General XMetaL Discussion

XMetaL Community Forum General XMetaL Discussion XMetaL 5.5 does not correctly detect UTF-8? Anyone know if new versions do?

  • AryehSanders

    XMetaL 5.5 does not correctly detect UTF-8? Anyone know if new versions do?

    Participants 8
    Replies 9
    Last Activity 10 years, 5 months ago

    Using XMetaL Author Essential 5.5.0.266, a file that simply contains NBSP, where NBSP is actually the UTF-8 sequence for NBSP (in hex, C2 A0) shows up with an A with a circumflex.  This is against the XML spec and seemingly against the XMetaL help file.  In the section on character encodings, it says: “If the encoding cannot be determined, then the default is used: UTF-8 for XML documents and ANSI (Latin-1) for SGML documents.”  It does correctly identify it if the files either start with a BOM or with an XML declaration, although the XML specification doesn't require either if UTF-8 is used.

    Other users are experiencing this with XMetaL Author Enterprise 5.5.  Is there a setting somewhere to fix this? Is this fixed in newer versions of XMetaL?

    Thanks,
    Aryeh

    Reply

    Derek Read

    Reply to: XMetaL 5.5 does not correctly detect UTF-8? Anyone know if new versions do?

    The sequence C2 A0 is a proper UTF-8 byte sequence for the Unicode NO-BREAK SPACE character. Perhaps you are confusing that byte sequence with the Unicode code point “U+00A0” or XML entity  

    I suspect the other XML parser you are using to check this is incorrectly assuming your file is encoded as Windows CP1252 or Latin1/ISO-8859-1 (that can happen quite easily for Windows applications). Or perhaps you are saving from another application and it is not correctly encoding the file? You may wish to add encoding=”UTF-8″ to your XML declaration to give it an extra hint. XML parsers must be able to read UTF-8 so if it is a proper parser I think that should give it enough information.

    Can you let me know what this other software is?

    Reply

    Derek Read

    Reply to: XMetaL 5.5 does not correctly detect UTF-8? Anyone know if new versions do?

    In case it helps, here is a page that lists the byte sequences for this character in various encodings, including UTF-8:
    http://www.fileformat.info/info/unicode/char/00a0/index.htm

    Table 3.1B on the following page may also help somewhat:
    http://www.unicode.org/versions/corrigendum1.html

    The character in question falls into the range “U+0080..U+07FF”. Find that in Table 3.1B then follow along to the right.

    Reply

    AryehSanders

    Reply to: XMetaL 5.5 does not correctly detect UTF-8? Anyone know if new versions do?

    That's exactly the problem.  C2 A0 is supposed to be NBSP, but that's not how XMetaL interprets it.

    Reply

    Derek Read

    Reply to: XMetaL 5.5 does not correctly detect UTF-8? Anyone know if new versions do?

    Please post a copy of the file XMetaL is opening incorrectly or submit a support case to XMetaL Support.
    Then I can let you know why it thinks the file is not UTF-8 and is treating it as Latin1.

    Let me know which software was used to create the file as well so we can have a look at it if possible.

    Reply

    AryehSanders

    Reply to: XMetaL 5.5 does not correctly detect UTF-8? Anyone know if new versions do?

    I should add that I've been using a hex editor to be sure there are no other encoding issues.  I'm attaching the file that I'm using.  The contents should be interpreted as NBSP, but they aren't.

    I generated the test file with perl/Cygwin:  perl -e 'print “xc2xa0“' > test.xml

    The case where this caused an issue for us is a c# program that stripped the XML declaration from files in SharePoint.  Since the encoding is UTF-8, in principle, that shouldn't have caused a problem, but…

    At any rate, we really would also like to know if it works properly in newer versions, since we were discussing an upgrade anyway.

    Thanks for your help.

    Reply

    Derek Read

    Reply to: XMetaL 5.5 does not correctly detect UTF-8? Anyone know if new versions do?

    Encoding handling has not been modified since the 2.1 release (when proper support for Unicode was added) so there won't be any difference between the 5, 6 and 7 releases.

    I'll have a look and see if I can figure out what's going on with this file.

    Reply

    Derek Read

    Reply to: XMetaL 5.5 does not correctly detect UTF-8? Anyone know if new versions do?

    I see the issue. The document is a partial document (it does not contain an XML declaration or DOCTYPE declaration) so XMetaL is making a best guess and it is guessing Latin1. XMetaL uses the XML declaration to detect encoding and in this case it is guessing incorrectly (there really is no way for it to know exactly).

    Adding an XML declaration to the file should solve the issue as XMetaL Author will then be able to use the

    Reply

    AryehSanders

    Reply to: XMetaL 5.5 does not correctly detect UTF-8? Anyone know if new versions do?

    Our actual documents do have a valid DOCTYPE.  This file is just a minimal test case.  XMetaL handles the characters the same way with a DOCTYPE and without.  If there is neither an xml declaration (http://www.w3.org/TR/2008/REC-xml-20081126/#charencoding), it says “it is a fatal error… for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8.”  Latin1 isn't a really an option.  The spec does not require a BOM or an xml declaration, even though it does recommend the xml declaration.

    Thanks again.

    Reply

    Derek Read

    Reply to: XMetaL 5.5 does not correctly detect UTF-8? Anyone know if new versions do?

    I guess that sounds right. I'll pass this along to development so they can look into it and prioritize it.

    Reply

  • You must be logged in to reply to this topic.

Lost Your Password?

Products
Downloads
Support