Background:I have written this topic for people confused by encodings. Hopefully this will help you understand why Unicode is important and why in most cases you should use it.
"XMetaL":References to the term "XMetaL" herein refer to "XMetaL Author" together with products that use it at the core (includes "XMetaL Author Enterprise") in addition to "XMAX".In Brief:Unless you have legacy or 3rd party software that is unable to handle UTF-8 encoding you should not normally need to concern yourself with encodings at all (and may wish to avoid confusion by not reading any further). By default (when no encoding is specifically set in a document) XMetaL saves that document in the most universally compatible encoding (supported by the vast majority of XML processors). That encoding is UTF-8. XMetaL only saves in another encoding when specifically instructed to do so (covered later).
Official Documentation:For the most current information please refer to the XMetaL Author help. While I am writing this document the most relevant help topic is: "Character encoding and Unicode support".
Encodings in XML:Section
2.2 Characters in the W3C Recommendation for Extensible Markup Language (XML) 1.0 (Fifth Edition) states:
All XML processors MUST accept the UTF-8 and UTF-16 encodings of Unicode...
So, all other encoding types supported by any particular XML processor are optional and not guaranteed to be supported by any other XML processor. To state this another way, only universal support for UTF-8 and UTF-16 is guaranteed (though almost all software, including most operating systems, will typically also support ASCII, and by design the first 128 characters defined in UTF-8 are the same as ASCII).
Encodings Supported by XMetaL:XMetaL is a compliant XML processor and supports both the UTF-8 and UTF-16 encodings. For compatibility with legacy software, and because it also supports SGML editing, XMetaL also supports "ASCII" and "ISO-8859-1" (aka: Latin1).
When a document is opened in XMetaL all characters are stored internally (in memory) as UTF-16. It is only upon saving that XMetaL encodes the document with a given encoding type (one of the 4 supported types listed above). The encoding of the saved document is controlled by the value of the encoding attribute present in the XML declaration at the beginning of any given XML document (visible only in PlainText view):
Saved as UTF-8: <?xml version="1.0"?>
or <?xml version="1.0" encoding="UTF-8"?>
Saved as UTF-16: <?xml version="1.0" encoding="UTF-16"?>
Saved as ASCII: <?xml version="1.0" encoding="US-ASCII"?>
Saved as ISO-8859-1: <?xml version="1.0" encoding="ISO-8859-1"?>
Though four different encoding types are supported, various values may be used to identify them. In order to make XMetaL as compatible as possible with 3rd party software, values that mean the same thing (aliases) may be used. For example, the following aliases can also be used instead of "ISO-8859-1" as far as XMetaL is concerned: CP819, CSISOLATIN1, IBM819, ISO-8859-1, ISO_8859-1, ISO_8859-1:1987, ISO-IR-100, LATIN1, L1 (refer to XMetaL Author's help topic "Character encoding and Unicode support" for the complete list). The actual encoding of the document will be the same no matter which value (alias) you select in this case, however, whether 3rd party software recognizes the encoding value you choose will depend on that software. In some cases an application will not even read this value and may attempt to determine the encoding by reading characters (or sequential bytes) contained in the document.
Saving Character Entities:XML allows characters to be encoded in a document as character entities. There are two types of character entity reference: named and numbered. Typically, named character entities are defined in a DTD though they may also be defined in the internal subset contained within the doctype declaration inside the XML document. Numbered entities come in two types, decimal and hexadecimal.
As stated previously, in XMetaL characters are always stored internally as the characters themselves using UTF-16 and only upon saving is a particular encoding applied (including conversion to UTF-8 in fact). We have been asked by a few clients over the years how XMetaL can be made to save character entity references to disk. This is possible but not when a document is saved as UTF-8 or UTF-16 because those encodings (should typically) define all the possible characters you may wish to use and it simply does not make sense to do so (please also see "Predefined XML Entities" and "Scripting" below).
If a character you require is not defined in Unicode please take that up with the Unicode Consortium (they have a process for this) or perhaps define it yourself (for internal use only) in one of the character blocks the Unicode Consortium has set aside for "private use".
As stated previously, by definition there are no compliant XML processors without support for UTF-8 or UTF-16, however, you may have legacy software or an operating system that needs to process or store your documents but it cannot be updated to support Unicode, while at the same time you have the need to insert characters only defined in Unicode (examples: Russian, Japanese, Chinese, special typographic symbols, etc) that this legacy system can only support when character entity references are used. In this case you should save your files using either ASCII or ISO-8859-1 encoding. The encoding you select will depend on whether your legacy software supports just ASCII or the slightly more robust ISO-8859-1.
ASCII consists of 128 characters while
ISO-8859-1 defines 256 characters.
Predefined XML Entities:XML defines 5 named character entities that do not need to be declared in your DTD. All XML processors must recognize these. They are defined in section
2.4 Character Data and Markup of the W3C Recommendation for Extensible Markup Language (XML) 1.0. XMetaL handles the escaping of these characters automatically in TagsOn and Normal views.
The PlainText View Exception:When you save a document while viewing it in PlainText view (typically used only by very XML-savvy authors) what is written to disk should be exactly what you see on screen. This means whitespaces and character entities are left as you see them. This differs from the other two editing views (TagsOn and Normal) in that when saving from those two views Pretty Printing (if enabled in the CTM file) is applied and all character entities are converted to the corresponding characters defined by the encoding declared for the document (for all of the reasons explained earlier). Note that in some cases an organization (and many CMS integrations) may disable PlainText view using the API Application.DisablePlainTextView() so this view may not be available in some cases.
Scripting:The entire XML source document is available to macros via the ActiveDocument.Xml property. This, in conjunction with the fact that the standard Save, SaveAs and SaveAll functions in XMetaL can be overridden using the "File operations" event macros (refer to the XMetaL Developer Programmer's Guide) means that if for some reason you need to save a document in a certain encoding (UTF-16, UTF-8 or ISO-8859-1) while at the same time preserving some or all character entities it should (in theory) be possible to do so. You would need to parse the string returned by ActiveDocument.Xml to replace the characters you wish to save with character entities, then you would need to write that new string out with a 3rd party control such as the Windows File System Object. I cannot think of a good reason to do this (which I believe is backed up by all the information above). However, the possibility is there.
Additional References:W3C XML Recommendation (current version)
The Unicode ConsortiumThe Unicode Standard