General XMetaL Discussion

XMetaL Community Forum General XMetaL Discussion XML Document Encodings

  • Derek Read

    XML Document Encodings

    Participants 1
    Replies 2
    Last Activity 13 years, 8 months ago

    Background:
    I have written this topic for people confused by encodings. Hopefully this will help you understand why Unicode is important and why in most cases you should use it.

    “XMetaL”:
    References to the term “XMetaL” herein refer to “XMetaL Author” together with products that use it at the core (includes “XMetaL Author Enterprise”) in addition to “XMAX”.

    In Brief:
    Unless you have legacy or 3rd party software that is unable to handle UTF-8 encoding you should not normally need to concern yourself with encodings at all (and may wish to avoid confusion by not reading any further). By default (when no encoding is specifically set in a document) XMetaL saves that document in the most universally compatible encoding (supported by the vast majority of XML processors). That encoding is UTF-8. XMetaL only saves in another encoding when specifically instructed to do so (covered later).

    Official Documentation:
    For the most current information please refer to the XMetaL Author help. While I am writing this document the most relevant help topic is: “Character encoding and Unicode support”.

    Encodings in XML:
    Section [url=http://www.w3.org/TR/2008/REC-xml-20081126/#charsets]2.2 Characters[/url] in the W3C Recommendation for Extensible Markup Language (XML) 1.0 (Fifth Edition) states:

    All XML processors MUST accept the UTF-8 and UTF-16 encodings of Unicode…

    So, all other encoding types supported by any particular XML processor are optional and not guaranteed to be supported by any other XML processor. To state this another way, only universal support for UTF-8 and UTF-16 is guaranteed (though almost all software, including most operating systems, will typically also support ASCII, and by design the first 128 characters defined in UTF-8 are the same as ASCII).

    Encodings Supported by XMetaL:
    XMetaL is a compliant XML processor and supports both the UTF-8 and UTF-16 encodings. For compatibility with legacy software, and because it also supports SGML editing, XMetaL also supports “ASCII” and “ISO-8859-1” (aka: Latin1).

    When a document is opened in XMetaL all characters are stored internally (in memory) as UTF-16. It is only upon saving that XMetaL encodes the document with a given encoding type (one of the 4 supported types listed above). The encoding of the saved document is controlled by the value of the encoding attribute present in the XML declaration at the beginning of any given XML document (visible only in PlainText view):

    Saved as UTF-8:   or 
    Saved as UTF-16:
    Saved as ASCII:
    Saved as ISO-8859-1:

    Though four different encoding types are supported, various values may be used to identify them. In order to make XMetaL as compatible as possible with 3rd party software, values that mean the same thing (aliases) may be used. For example, the following aliases can also be used instead of “ISO-8859-1” as far as XMetaL is concerned: CP819, CSISOLATIN1, IBM819, ISO-8859-1, ISO_8859-1, ISO_8859-1:1987, ISO-IR-100, LATIN1, L1 (refer to XMetaL Author's help topic “Character encoding and Unicode support” for the complete list). The actual encoding of the document will be the same no matter which value (alias) you select in this case, however, whether 3rd party software recognizes the encoding value you choose will depend on that software. In some cases an application will not even read this value and may attempt to determine the encoding by reading characters (or sequential bytes) contained in the document.

    Saving Character Entities:
    XML allows characters to be encoded in a document as character entities. There are two types of character entity reference: named and numbered. Typically, named character entities are defined in a DTD though they may also be defined in the internal subset contained within the doctype declaration inside the XML document. Numbered entities come in two types, decimal and hexadecimal.

    As stated previously, in XMetaL characters are always stored internally as the characters themselves using UTF-16 and only upon saving is a particular encoding applied (including conversion to UTF-8 in fact). We have been asked by a few clients over the years how XMetaL can be made to save character entity references to disk. This is possible but not when a document is saved as UTF-8 or UTF-16 because those encodings (should typically) define all the possible characters you may wish to use and it simply does not make sense to do so (please also see “Predefined XML Entities” and “Scripting” below).

    If a character you require is not defined in Unicode please take that up with the Unicode Consortium (they have a process for this) or perhaps define it yourself (for internal use only) in one of the character blocks the Unicode Consortium has set aside for “private use”.

    As stated previously, by definition there are no compliant XML processors without support for UTF-8 or UTF-16, however, you may have legacy software or an operating system that needs to process or store your documents but it cannot be updated to support Unicode, while at the same time you have the need to insert characters only defined in Unicode (examples: Russian, Japanese, Chinese, special typographic symbols, etc) that this legacy system can only support when character entity references are used. In this case you should save your files using either ASCII or ISO-8859-1 encoding. The encoding you select will depend on whether your legacy software supports just ASCII or the slightly more robust ISO-8859-1. [url=http://en.wikipedia.org/wiki/ASCII]ASCII[/url] consists of 128 characters while [url=http://en.wikipedia.org/wiki/Iso-8859-1]ISO-8859-1[/url] defines 256 characters.

    Predefined XML Entities:
    XML defines 5 named character entities that do not need to be declared in your DTD. All XML processors must recognize these. They are defined in section [url=http://www.w3.org/TR/2008/REC-xml-20081126/#syntax]2.4 Character Data and Markup[/url] of the W3C Recommendation for Extensible Markup Language (XML) 1.0. XMetaL handles the escaping of these characters automatically in TagsOn and Normal views.

    The PlainText View Exception:
    When you save a document while viewing it in PlainText view (typically used only by very XML-savvy authors) what is written to disk should be exactly what you see on screen. This means whitespaces and character entities are left as you see them. This differs from the other two editing views (TagsOn and Normal) in that when saving from those two views Pretty Printing (if enabled in the CTM file) is applied and all character entities are converted to the corresponding characters defined by the encoding declared for the document (for all of the reasons explained earlier). Note that in some cases an organization (and many CMS integrations) may disable PlainText view using the API Application.DisablePlainTextView() so this view may not be available in some cases.

    Scripting:
    The entire XML source document is available to macros via the ActiveDocument.Xml property. This, in conjunction with the fact that the standard Save, SaveAs and SaveAll functions in XMetaL can be overridden using the “File operations” event macros (refer to the XMetaL Developer Programmer's Guide) means that if for some reason you need to save a document in a certain encoding (UTF-16, UTF-8 or ISO-8859-1) while at the same time preserving some or all character entities it should (in theory) be possible to do so. You would need to parse the string returned by ActiveDocument.Xml to replace the characters you wish to save with character entities, then you would need to write that new string out with a 3rd party control such as the Windows File System Object. I cannot think of a good reason to do this (which I believe is backed up by all the information above). However, the possibility is there.

    Additional References:
    [url=http://www.w3.org/TR/xml/]W3C XML Recommendation[/url] (current version)
    [url=http://www.unicode.org]The Unicode Consortium[/url]
    [url=http://www.unicode.org/standard/standard.html]The Unicode Standard[/url]

    Reply

    rogiez

    Reply to: XML Document Encodings

    After reading this I'm still wondering about one thing:

    When opening an xml document in plain text view I see special characters as non-breaking spaces as code ( ) . But after switching back and forth between Tags On view and back to Plain text the characters are replaced by a space. This is confusing to me, because if I save and look at the saved file in a text editor the code hasn't been changed and I expect in plain text view to see all special characters as code.

    Is there a way to set this in the option menu or change this in a settings file?

    I'm using XMetaL(R) Author Enterprise 6.0 Service Pack 1.

    Regards, Rogiez

    Reply

    Derek Read

    Reply to: XML Document Encodings

    rogiez: I assume the encoding specified in your XML file is ASCII (the product supports the following for that encoding in case you may be using some other name: US-ASCII, ANSI_X3.4-1968, ANSI_X3.4-1986, ASCII, CP367, CSASCII, IBM367, ISO_646.IRV:1991, ISO646-US, ISO-IR-6, US, US-ASCII).

    That is the only encoding supported by XMetaL Author that might cause the character   (Unicode Name: “NO-BREAK SPACE”) to be saved as a numeric character entity in the XML source. The other three supported encodings (LATIN1, UTF-8 and UTF-16) all define that character and so when you save to those encodings the character is simply written out as itself, not as a numeric character entity reference.

    I'm not sure how to explain this without a long explanation, so unfortunately…

    If you use XMetaL Author to save a document using ASCII encoding and it contains characters not defined in the ASCII encoding specification (any character with a Unicode code point above 127 — the first 128 characters in Unicode are the same as ASCII but ASCII only defines 128 characters) you will see numeric character entity references for them (such as  ) in the file if you open the document in an editor that does not render these entities as single characters (such as Notepad) but not in other software (such as most web browsers for example).

    In XMetaL*:

    • If you open any document directly into Plain Text you will see something similar to opening the file in Notepad or other simple text editors.
    • If you open such a document directly into Tags On or Normal view an “encoding import” is performed.
    • When you switch from Plain Text view into one of these other two views the same “import” is performed. Essentially this means that switching from Plain Text view into Tags On or Normal view is identical to opening the file from disk directly into these two other views.

    So, what do I mean by “encoding import”? In order to provide all the functionality necessary for editing (which includes scripting through the 1200+ APIs the product supports in Tags On and Normal views) XMetaL Author converts the XML source into a common encoding, and that is UTF-16. At this point, when documents are viewed in Tags On or Normal view, all characters in the document that have a corresponding glyph for them in the font specified for a particular element (this is done in the CSS file you have created for your customization, or in the case of DITA the CSS files we ship) it is used to render that character.

    When you save a document to disk a similar “encoding export” is done that converts the internal representation of the XML source from UTF-16 into the encoding you have specified, and if you have specified ASCII then any character above Unicode code point 127 will appear as a character entity reference on disk. An “encoding export” is not done when you switch into Plain Text view from the other two views. That's probably the root of your issue. There is no way to alter this behaviour with a setting.

    I suspect your work flow is like this:
    1) Switch to Plain Text view for an already opened document or open a document directly into Plain Text view. The document's encoding is set to “ASCII” (or one of the variants for ASCII listed above).
    2) The document contains numeric character entities &#00A0; or you enter the XML code for them manually.
    3) You switch into either Tags On or Normal view. In this view the characters appear as “normal” spaces because the font being used renders them as such, and almost all fonts simply redirect references to &#x00A0 (ie: decimal character 160) to the regular space glyph (which is  or decimal character 32). It is at this point that the “encoding import” has been performed.
    4) You switch back to Plain Text view and see that these characters are represented using a single character that appears to be a normal space (under the covers it is actually Unicode code point 00A0 aka decimal 160). This is also due to the font (though in this case the font for Plain Text view is specified in Tools > Options and not via CSS).
    5) You save the document. The “encoding export” into ASCII is done. You open the file with a text editor and see   wherever a NO-BREAK SPACE was entered.

    *XMetaL here may be: XMetaL Author Essential, XMetaL Author Enterprise or XMAX.

    Reply

  • You must be logged in to reply to this topic.

Lost Your Password?

Products
Downloads
Support