General XMetaL Discussion
XMetaL Community Forum › General XMetaL Discussion › XMetaL 5.5 does not correctly detect UTF-8? Anyone know if new versions do?
-
AryehSanders April 18, 2012 at 7:46 am
XMetaL 5.5 does not correctly detect UTF-8? Anyone know if new versions do?
April 18, 2012 at 7:46 amParticipants 8Replies 9Last Activity 10 years, 9 months agoUsing XMetaL Author Essential 5.5.0.266, a file that simply contains
NBSP , where NBSP is actually the UTF-8 sequence for NBSP (in hex, C2 A0) shows up with an A with a circumflex. This is against the XML spec and seemingly against the XMetaL help file. In the section on character encodings, it says: “If the encoding cannot be determined, then the default is used: UTF-8 for XML documents and ANSI (Latin-1) for SGML documents.” It does correctly identify it if the files either start with a BOM or with an XML declaration, although the XML specification doesn't require either if UTF-8 is used.Other users are experiencing this with XMetaL Author Enterprise 5.5. Is there a setting somewhere to fix this? Is this fixed in newer versions of XMetaL?
Thanks,
AryehDerek Read April 18, 2012 at 5:12 pm
Reply to: XMetaL 5.5 does not correctly detect UTF-8? Anyone know if new versions do?
April 18, 2012 at 5:12 pmThe sequence C2 A0 is a proper UTF-8 byte sequence for the Unicode NO-BREAK SPACE character. Perhaps you are confusing that byte sequence with the Unicode code point “U+00A0” or XML entity
I suspect the other XML parser you are using to check this is incorrectly assuming your file is encoded as Windows CP1252 or Latin1/ISO-8859-1 (that can happen quite easily for Windows applications). Or perhaps you are saving from another application and it is not correctly encoding the file? You may wish to add encoding=”UTF-8″ to your XML declaration to give it an extra hint. XML parsers must be able to read UTF-8 so if it is a proper parser I think that should give it enough information.
Can you let me know what this other software is?
Derek Read April 18, 2012 at 6:10 pm
Reply to: XMetaL 5.5 does not correctly detect UTF-8? Anyone know if new versions do?
April 18, 2012 at 6:10 pmIn case it helps, here is a page that lists the byte sequences for this character in various encodings, including UTF-8:
http://www.fileformat.info/info/unicode/char/00a0/index.htmTable 3.1B on the following page may also help somewhat:
http://www.unicode.org/versions/corrigendum1.htmlThe character in question falls into the range “U+0080..U+07FF”. Find that in Table 3.1B then follow along to the right.
AryehSanders April 18, 2012 at 7:02 pm
Reply to: XMetaL 5.5 does not correctly detect UTF-8? Anyone know if new versions do?
April 18, 2012 at 7:02 pmThat's exactly the problem. C2 A0 is supposed to be NBSP, but that's not how XMetaL interprets it.
Derek Read April 18, 2012 at 7:04 pm
Reply to: XMetaL 5.5 does not correctly detect UTF-8? Anyone know if new versions do?
April 18, 2012 at 7:04 pmPlease post a copy of the file XMetaL is opening incorrectly or submit a support case to XMetaL Support.
Then I can let you know why it thinks the file is not UTF-8 and is treating it as Latin1.Let me know which software was used to create the file as well so we can have a look at it if possible.
AryehSanders April 18, 2012 at 7:08 pm
Reply to: XMetaL 5.5 does not correctly detect UTF-8? Anyone know if new versions do?
April 18, 2012 at 7:08 pmI should add that I've been using a hex editor to be sure there are no other encoding issues. I'm attaching the file that I'm using. The contents should be interpreted as NBSP, but they aren't.
I generated the test file with perl/Cygwin: perl -e 'print “xc2xa0“' > test.xml
The case where this caused an issue for us is a c# program that stripped the XML declaration from files in SharePoint. Since the encoding is UTF-8, in principle, that shouldn't have caused a problem, but…
At any rate, we really would also like to know if it works properly in newer versions, since we were discussing an upgrade anyway.
Thanks for your help.
Derek Read April 18, 2012 at 8:37 pm
Reply to: XMetaL 5.5 does not correctly detect UTF-8? Anyone know if new versions do?
April 18, 2012 at 8:37 pmEncoding handling has not been modified since the 2.1 release (when proper support for Unicode was added) so there won't be any difference between the 5, 6 and 7 releases.
I'll have a look and see if I can figure out what's going on with this file.
Derek Read April 18, 2012 at 8:54 pm
Reply to: XMetaL 5.5 does not correctly detect UTF-8? Anyone know if new versions do?
April 18, 2012 at 8:54 pmI see the issue. The document is a partial document (it does not contain an XML declaration or DOCTYPE declaration) so XMetaL is making a best guess and it is guessing Latin1. XMetaL uses the XML declaration to detect encoding and in this case it is guessing incorrectly (there really is no way for it to know exactly).
Adding an XML declaration to the file should solve the issue as XMetaL Author will then be able to use the
AryehSanders April 18, 2012 at 10:30 pm
Reply to: XMetaL 5.5 does not correctly detect UTF-8? Anyone know if new versions do?
April 18, 2012 at 10:30 pmOur actual documents do have a valid DOCTYPE. This file is just a minimal test case. XMetaL handles the characters the same way with a DOCTYPE and without. If there is neither an xml declaration (http://www.w3.org/TR/2008/REC-xml-20081126/#charencoding), it says “it is a fatal error… for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8.” Latin1 isn't a really an option. The spec does not require a BOM or an xml declaration, even though it does recommend the xml declaration.
Thanks again.
Derek Read April 19, 2012 at 1:04 am
Reply to: XMetaL 5.5 does not correctly detect UTF-8? Anyone know if new versions do?
April 19, 2012 at 1:04 amI guess that sounds right. I'll pass this along to development so they can look into it and prioritize it.
-
AuthorPosts
- You must be logged in to reply to this topic.