General XMetaL Discussion

XMetaL Community Forum General XMetaL Discussion Using xml:lang Values to Control Spell Checking

  • Derek Read

    Using xml:lang Values to Control Spell Checking

    Participants 14
    Replies 15
    Last Activity 13 years, 1 month ago

    XMetaL Author Enterprise 6.0
    (should also be supported in XMetaL Author Essential 6.0 when that product is released)

    Instead of providing default settings in the xmetal60.ini file we decided to leave it up clients to decide on these values (that's actually a good thing), however, apparently we also missed documenting this as well.

    Beware: this is going to be a long post …

    Request: I've spent a few hours writing this up and I think it covers most things at this point, but feedback is very welcome. 2009/12/15: I've made some changes directly to the original post after getting feedback from Richard Ishida (it should be easier to read all in one place rather than jumping back and forth between comments).

    Background / Legacy Code
    The values that XMetaL Author recognizes for spell checking default to legacy values that the product uses internally. These values were invented before xml:lang existed (actually before XML existed due to the fact that it originally came from another product). In most cases they do not match any of the RFC values most people would wish to use with xml:lang. Some of the more common ones happen to match (like EN) but this is just by chance and quite a few others do not.

    Standard xml:lang Language Codes
    The W3C XML Recommendation defines basic rules for xml:lang (how it must be declared in your DTD or Schema). Also related to this are the standards ISO-639-1, ISO-639-2, RFC4646, and RFC4647 and RFC5646 (the last one actually makes 4646 obsolete). Also related is BCP47 which is  the reference preferred by the W3C. BCP47 is a concatenation of several RFCs and though long basically puts everything in one place.

    Basically, ISO-639-1 consists of two letter language codes (that many people may recognize) and ISO-639-2 uses three letter codes. The RFCs describe how the full code should be constructed, and codes may include language, region, script, 'variants' and other things, including rules on letter casing and separator characters like “-“. If you need to read one document please read BCP47.

    We've tried to design our spell checking support for xml:lang to be as flexible as possible. This means you may opt to specify any “standard” value or you may use other values (perhaps from an industry or other standard you may wish to follow), and you may specify multiple values in the INI file for a particular spell checking language (keeping in mind that the value for the xml:lang attribute in the XML source itself can only have one value and will therefore either match one INI setting or none).

    This means you should decide which values you will use based on all of your requirements, from external tools, XSLT transforms, specifications, etc, first. Then configure XMetaL Author's spell checker to understand the values you are working with. This is the approach I would recommend: let your requirements drive the values you use, but whenever possible stick to the most current W3C and associated standards.

    Table of INI Variables Supported by the Spell Checker for xml:lang
    Following is the complete list of currently supported spell checking languages. It includes the INI variable name (prefixed with “WT”) that controls the values you wish to have recognized for the xml:lang attribute, the English name for the language, and the corresponding ISO-639-1 and ISO-639-2 value(s) that I think would most commonly be used for that language by most people working with xml:lang.

    The values listed here for ISO639-1 and ISO-639-2 are suggestions only, though they were taken directly from those specs. Be sure you consult with other people in your organization before deciding on exact values as other tools and processes may have specific requirements.

    [u]INI Variable Name[/u] [u]English Name for the Language[/u] [u]ISO-639-1[/u]           [u]ISO-639-2[/u]
    WT_AFRIKAANS Afrikaans af afr
    WT_CATALAN Catalan ca cat
    WT_CZECH Czech cs ces, cze
    Note: Both codes are considered synonyms.
    WT_DANISH Danish da dan
    WT_DUTCH Dutch nl dut, nld
    Note: Both codes are considered synonyms.
    WT_ENGLISH English en eng
    WT_FRENCH French fr fra, fre
    Note: Both codes are considered synonyms.
    WT_GALICIAN Galacian gl glg
    WT_GERMAN German de ger, deu
    Note: Both codes are considered synonyms.
    WT_GREEK Greek el gre, ell
    Note(1): Both codes are considered synonyms.
    Note(2): Ancient Greek (before the year 1454) is “grc” and is not supported by the spell checker.
    WT_ISLANDIC Islandic (Icelandic) is ice, isl
    Note: Both codes are considered synonyms.
    WT_ITALIAN Italian it ita
    WT_NORWEGIAN Norwegian no nor
    WT_PORTUGUESE Portuguese pt por
    WT_RUSSIAN Russian ru rus
    WT_SLOVAK Slovak sk slo, slk
    Note: Both codes are considered synonyms.
    WT_SESOTHO Sesotho (Sotho, South Sotho) st sot
    WT_SPANISH Spanish es spa
    WT_SWEDISH Swedish sv swe
    WT_SETSWANA Setswana (Tswana) tn tsn
    WT_TURKISH Turkish tr tur
    WT_XHOSA Xhosa xh xho
    WT_ZULU Zulu zu zul
    WT_ENGLISH_AUSTRALIAN Australian English en-au eng-AU
    WT_ENGLISH_CANADIAN Canadian English en-ca eng-CA
    WT_ENGLISH_BRITISH British English en-gb eng-GB
    WT_ENGLISH_US United States English en-us eng-US
    WT_FRENCH_CANADIAN Canadian French fr-ca fra-CA, fre-CA
    WT_GERMAN_SWISS Swiss German de-ch deu-CH, ger-CH
    WT_PORTUGUESE_BRASIL Brazilian Portuguese pt-br por-BR
    WT_SPANISH_AMERICAN American Spanish es-us spa-US
    WT_NO_LINGUISTIC_CONTENT Do Not Spell Check (treat content as a non-spellcheckable language) zxx

    INI Settings Examples
    The values listed below for ISO639-1 (two letter codes) and ISO-639-2 (three letter codes) are suggestions only, though they were taken directly from those specs. Be sure you consult with other people in your organization before deciding on exact values as other tools and processes may have specific requirements.

    If the xml:lang code (the value portion of the INI variable) does not include the particular value you need just replace the existing one, or append your additional value to the end after adding a semicolon.

    In the dialects section, two letter country codes are appended to the language code to make up “dialects” which are specific regional variances in languages, however (again) these values are here as examples only and it is up to you to decide what is correct for your organization's purposes.



    Note(1): If your xml:lang value's language code is not listed in the INI file then the fallback functionality of the spell checker is to use the default language as selected in the spell checker's Options dialog (set from within the main spell checker dialog, launched via F7).

    Note(2): Letter casing (uppercase vs lowercase) is ignored with regard to xml:lang (ie: “EN-US”, “en-us” and “en-US” are considered equivalent).

    Note(3): zxx has been recommended to represent text that should not be interpreted as a standard human language. When it is used as set above XMetaL will skip over any element with xml:lang set to this value and not spell check it at all. This is useful for sections of programming code or perhaps other uses. As with all the other values here you may configure WT_NO_LINGUISTIC_CONTENT to whatever you like if “zxx” does not meet your needs (provided the value meets the xml:lang attribute value rules in the W3C XML Recommendation).

    Note(4): Regardless of any settings in the INI file, when an xml:lang attribute value is set to be an empty string value, such as xml:lang="" that element will be skipped and not spell checked. This behavior is essentially equivalent to #3 above from the point of view of the spell checker (though it does have a distinct difference in meaning which is actually “no language” as opposed to “non human language”). However, XMetaL Author purposely makes it difficult for users to set an attribute value to be an empty string using the Attribute Inspector, so to do this you must either have implemented special code in your XMetaL Author customization to allow users to accomplish this, or you must set the value using PlainText view.

    Note(5): The values you use in the INI file should be unique to each setting. Meaning that if you specify the same value in more than one INI variable unexpected behavior will occur. Please don't ask what the behavior might be, just avoid doing this.

    Note(6): Do not specify the same INI variable multiple times. This should not be an issue as far as XMetaL is concerned, but you may not see the results you expect in this case. Again, please don't ask what the behavior might be, just avoid doing this.

    The Shipped xmetal60.ini File
    The following setting is included with the xmetal60.ini file:
    This can be safely removed if desired. It should be removed if you will be specifying your own WT_ENGLISH_BRITISH settings elsewhere in the INI file to be sure there are no conflicts. Note however, that the default internal (legacy) code of “EN-UK” will be recognized if this variable is not present and set to another value.

    How the Auto-Switching Works
    The spell checker, whether you use the spell checking dialog (F7) or use the new 6.0 release's “check spelling while typing” option (see Tools > Options) aka: “red squiggles”, XMetaL Author switches to the language specified in the xml:lang attribute when entering an element containing PCDATA (text).

    If that element in turn has a child element with a different xml:value the spell checker changes to that corresponding child element's value. When no xml:lang value is set for an element it inherits the value of the parent element or nearest ancestor (standard xml:lang rules).

    If such an element has no ancestors with an xml:lang value set then the default value for spell checking (as set in the spell checker's Options dialog) is used.

    So, assuming you have all the settings above in your INI file and your default language is set to “English-US” in the spell checker's Options dialog, when entering a given element with one of the following xml:lang values the spell checker should do the following:

    • xml:lang is not set –> XMetaL begins walking up the document tree checking for parent elements with an xml:lang value set (and uses the nearest). If it fails to find any then the value as set in the spell checker Options dialog (in this case “English-US”) is used.
    • xml:lang=”en” –> All English spellings (US, UK, CA and AU) are considered correct (both “colour” and “color” are considered correct).
    • xml:lang = “en-US” –> English-US is used (ie: “color” is correct, “colour” is incorrect)
    • xml:lang = “en-CA” –> English-CA is used (ie: “colour” is correct, “color” is incorrect)
    • xml:lang=”” –> no spell checking is performed (element is skipped)
    • xml:lang=”zxx” –> no spell checking is performed (element is skipped)

    External References



    Reply to: Using xml:lang Values to Control Spell Checking

    This sounds very cool. Is there a way (ideally without adding a phony xml:lang on the element) to specify a list of elements that should, by default, be skipped in spell checking? That way you could configure it not to spell check code listings and code-like things. This isn't as urgent since it now has the red squiggly style spell checking which is less obtrusive, but still would be nice.



    Derek Read

    Reply to: Using xml:lang Values to Control Spell Checking


    There are new APIs in 6.0 that should allow you to do this. When I have some time to properly describe that I will create a new forum post that covers this, unless we release XMetaL Developer 6.0 and associated docs before that.



    Reply to: Using xml:lang Values to Control Spell Checking

    First, this is a great step forward, and kudos to Justsystems for implementing it.  I have just a few points i'd like to raise:

    1. RFC 5646 obsoletes RFC4646 whether you choose to follow it or not ;-).  On the other hand, it *doesn't* obsolete RFC4647 – thanks still current.  Actually, at the W3C we prefer to refer to these specs using the label BCP 47 (  That covers both the language tag syntax spec (RFC 5646) and the matching spec (RFC 4647), and always refers to the most up-to-date version of each.

    2. It doesn't seem quite strongly enough stated for my taste that, if you aren't dealing with legacy situations, you should use language tags as defined in BCP47 as it says in the XML spec.  By implication, this means that you should use the IANA Language Subtag Registry to look up subtags, not use ISO code lists.  This is important, because the IANA registry provides only one subtag per language, whereas the ISO codes sometimes offer two or three possibilities.  The IANA registry also goes *way* beyond the list of codes offered by ISO 639-1/2, due to the inclusion of around 7,000 ISO 639-3 codes. Using the codes as defined in BCP47 and the IANA registry increases the interoperability of the data.

    3.  I would suggest that you label the right-mosts two columns in the table of languages above as BCP47 and Legacy, respectively.  (Note that the region codes are not actually part of ISO 639.)

    4. I assume that WT_German accepts spellings for either Swiss German or (National) German (eg. it fails to recognise incorrect omisson of es-zet characters).  It may be worth adding a note to that effect to the WT_German line, since otherwise people may assume that de is sufficient for spell-checking normal German, when actually it isn't.

    5. It may also be better to clarify the intended usage of zxx, which is *not* actually the same as xml:lang=””, although the effect for spell checking is the same (ie. skip the text).  See  (for further clarification, see

    Hope that helps,


    Derek Read

    Reply to: Using xml:lang Values to Control Spell Checking


    Thanks for the great feedback.

    Because our software is very often only one piece of a larger installation (which often includes CMS systems, work flow management systems, translation memory and management systems, post processing systems, and systems that perform transformations to various file formats, and perhaps other things) the goal here was to show how to enable the XMetaL spell checker to take advantage of xml:lang values to support spell checking.

    I'll leave education in usage of xml:lang up to the experts, and there is no shortage of information on this topic, including [url=]your posts here[/url], the W3C XML Recommendation and the specs it links to, and many books on XML including my favourite “Charles Goldfarb's XML Handbook” (Charles Goldfarb and Paul Prescod 4th Edition: ISBN 0-13-065198-2; 5th Edition: 0-13-049765-7).

    Regarding specific points…

    1. The main point here (which you understood) was that, for any given client using XMetaL, various factors come into play that may make it difficult to stick with the latest specs (and ultimately, the XML source is often for internal use only with files to be consumed externally being transformations based on this internal format). I agree though — whenever possible it makes sense to follow current specs.

    2/3. I'll have a look at these suggestions and make some changes.

    4. 'German National' is a special case. The first time German National is used you are prompted to select from one of three options (all XMetaL versions 4.x up to and including 6.0):

    • New spelling (Fluss)
    • Old spelling (Fluß)
    • Allow both

    Swiss German allows the new spelling method only.

    5. This is a good point as well. Approaching it strictly from an xml:lang usage point of view there really should be a difference (when used correctly), but (as you say) from the point of view of our spell checker there isn't any difference. The net result for the spell checker will be to skip over these elements.



    Reply to: Using xml:lang Values to Control Spell Checking

    Hi Derek,
    Could you point me to the new APIs in 6.0 that would allow me to skip spell checking based on element attributes?



    Derek Read

    Reply to: Using xml:lang Values to Control Spell Checking

    The APIs are not documented in the Programmer's Guide (6.0) yet, but here's an example you can try with the Journalist demo.
    It is still possible these APIs may change a little bit in the future (part of the reason we haven't documented them yet).

    [code]// Disable Spell Checking for Certain Nodes
    //********************************************************* function spellService() {
    //create the spell checker service
    } spellService.prototype.shouldSpellCheck = function(node) {
    //spell check every node…
    var spellCheck = 1; //…unless it triggers one of the following tests //node is an element
    if (node.nodeType == 1) {
    //element name = if (node.nodeName == “ProgramListing”) {
    //do not spell check
    spellCheck = 0;
    } //node's parent's attribute called “Style” has a value equal to “Bullet”
    if (node.parentNode.getAttribute(“Style”) == “Bullet”) {
    //do not spell check
    spellCheck = 0;
    } return spellCheck;
    } var spServ = new spellService(); ActiveDocument.SetSpellCheckerService(spServ);
    [/code] Note that because the journalist.mcr already has an “On_Document_Open_Complete” event macro you will want to incorporate this into that same section of the MCR file. With that in place try the following XML file that uses the journalist.dtd:


    spellchecked spellchecked spellcheckednotspellchecked spellchecked
    spellchecked notspellchecked

    [/code] This example is somewhat contrived because the Journalist demo only has an Id attribute for most elements that allow PCDATA. So, in this example a node that is directly inside an element with the attribute “Style” set to “Bullet” is skipped (ie: ), while child elements of that node are not skipped (they are spell checked). You will need to design your logic based on your own elements, attributes and their relationships of course. Hopefully you want to just skip entire elements, or have implemented something similar to xml:lang, as that should make the logic fairly straightforward. Ideally the amount of code in here should be kept to a minimum to make things run as fast as possible.



    Reply to: Using xml:lang Values to Control Spell Checking

    Excellent. This is exactly what I've wanted. Yes, I'll want to skip entire elements based on element names and in some cases, attribute values, (e.g. elements explicitly flagged as localize=”no” or turning spell checking on for elements explicitly flagged localize=”yes” that would otherwise be skipped, such as ).

    I'll give it a shot and let you know how it goes.




    Reply to: Using xml:lang Values to Control Spell Checking

    It works beautifully. Moreover, the performance gained by freeing XMetaL from the need to draw lines under so many words more than offsets any performance lost by doing the test. When I first started using XMetaL 6.0 with some real documents, the performance was noticeably worse than before unless I turned off the interactive spell checking. Now the performance is back to normal even with interactive spell checking on.

    This will be great too in that it rewards the writer for doing semantic markup and l10n prep. We use a localize attribute to indicate whether the contents of an element should be translated. We programmatically add the localize attribute to certain elements, but the writer can override the default behavior by manually adding localize=”yes” or localize=”no” to an element.




    Reply to: Using xml:lang Values to Control Spell Checking

    I have a few questions.

    1. Is all of the information still accurate for XMAX v7?

    2. We've noticed the following behavior when playing with the xml:lang attribute:

    xml:lang=”en” accepts “color” and “colour” as valid words.
    xml:lang=”en-US” accepts “color” and stops on “colour” (expected)
    xml:lang=”en-CA” does not seem to work. It doesn't even stop on “sfsdfsdfsf”.
    When the xml:lang attribute is omitted entirely, it accepts “colour” but stops on “color” (is this en-CA?)

    xml:lang=”fr-CA” does not work either. All words are ignored.
    xml:lang=”fr” seems to work for the french dictionary, but is it Canadian French?

    We need to be able to specify en-CA and fr-CA languages for the spell checker. How can we accomplish this?
    (Also, is there an .ini file when using XMAX? I could not find one.)

    3. Is there an API function that we can call instead of setting the xml:lang attribute (since our current DTD does not allow it)?



    Derek Read

    Reply to: Using xml:lang Values to Control Spell Checking

    The information I originally posted in this message is not accurate for XMAX.

    I'm checking with our dev team to see what can be offered for your situation, but the behaviour you are seeing is expected primarily because this configuration is done with INI settings (for XMetaL Author) but XMAX does not have an INI file.

    Some APIs have been added to recent releases of XMAX to allow you to configure it to support some features that are set using INI values in XMetaL Author, but these do not cover the spell checking INI settings.

    Essentially, the reason you are seeing the behaviour in XMAX is due to the fact that internally Writing Tools was created to support made-up language codes that mostly do not match standard xml:lang codes (Writing Tools predates xml:lang by a decade or so). When XMetaL Author communicates with Writing Tools it uses these odd codes, but the solution for exposing the correct codes to the outside (via xml:lang) requires correct values to be set in the INI file.

    The codes that Writing Tools supports internally do not include “en-ca”. The made-up codes include “CE” (presumably for Canadian English) and “en-oz” (Australian English). These are adjusted in the default XMetaL Author INI file to “en-ca” and . “fr-ca” is also not there and instead Writing Tools recognizes “CF” (not “fr-cf” just “CF”). Again, this is adjusted in the INI file for XMetaL Author.

    What they probably should have done was to hard code more standard xml:lang values into XMetaL Author itself as defaults (which would then be inherited by the XMAX code). The INI settings would still allow people to set their own values if desired but then at least the defaults would be normal xml:lang values.

    At this point I'm not sure what we're going to do, but I suspect we should do some cleanup in addition to adding specific support for this to XMAX (if still required after this cleanup).

    Workaround (?)

    I can think of a fairly elaborate workaround that might work now, but I'm not sure if you want to go to these lengths (this is a pretty hacky workaround). It would be possible to add xml:lang to your document's DTD without modifying the actual DTD using the event On_DTD_Open_Complete. The method is addAttribute() and depending on how you want to do it you might even add the xml:lang as a fixed attribute with a set value, probably to the root element (assuming your docs contain one language). If the value was set to “CE” or “CF” I think that might work. Setting it to be a “fixed” attribute should mean that you do not need to change the markup. If it were set to “implied” then you'd need to change the actual XML markup (then probably undo that before saving so that the document remains valid for the rest of your XML software chain).


    Derek Read

    Reply to: Using xml:lang Values to Control Spell Checking

    For anyone that wants to use proper xml:lang values with XMAX today (in the case where the current value for a particular language is non-standard) you can launch the spell checker then add a language that uses the xml:lang value you want via the Options > Language. Clicking the Add button lets you add a new language code, which you can then choose and select a language file to use with.

    These codes are limited to two letter codes only.

    Pretty sure this won't help mrpaul as the additional requirement there is that you don't want to modify the DTD, and the current DTD doesn't have xml:lang, so the scripting solution is probably easiest in that case.


    Derek Read

    Reply to: Using xml:lang Values to Control Spell Checking

    I'm not yet getting what I expected in my results when adding xml:lang programmatically via On_DTD_Open_Complete so that workaround might not be possible. Once I figure out for sure I'll post something here.


    Derek Read

    Reply to: Using xml:lang Values to Control Spell Checking

    OK, so here's the deal. The “internal” values for Canadian English and Canadian French aren't what I thought they were (my long post a couple of posts ago is still basically correct except for these two values). They are “en-ce” and “fr-cf”.

    So, if you want to use the scripting workaround mentioned previously and…

    …the document you are loading into XMAX is entirely Canadian French then run this script in On_DTD_Open_Complete:

    [code]// XMetaL Script Language JSCRIPT:
    var docType = ActiveDocument.doctype;

    …the document you are loading into XMAX is entirely Canadian English then run this script in On_DTD_Open_Complete:
    [code]// XMetaL Script Language JSCRIPT:
    var docType = ActiveDocument.doctype;



    Reply to: Using xml:lang Values to Control Spell Checking


    Some of our documents have mixed languages. Mainly, I need to be able to switch between “en-ce” and “fr-cf” depending on the current position of the cursor (we have a custom spell checking module that loops through the document using a range and calls XMAX spell checking functionality). In our XML, the different language sections have a lang=1 or lang=2 attribute that I can use to determine which spell checker dictionary to use.

    My question is: in which XMAX call does the value of xml:lang get read to determine if the word is spelled correctly or not? Is it inside the call to ActiveDocument.IsSpellingCorrect()? Is it a bad idea to continuously change the value of the document's doctype on the fly?

    I tried testing this but using the example you provided, I am getting an “attribute exists” exception from XMAX the 2nd time it tries to set the value (since the code that checks if the word is spelled correctly loops as the range changes its selection).

    Also, the first time it's added, when I view the ActiveDocument.Document.xml value, I cannot see the set xml:lang attribute. Is that normal?

    Can I call docType.addAttribute with an element that isn't the root (in the 1st param)? How should I go about this?

    Thank you.

    EDIT: Just to add to this, if I check docType.hasAttribute(“Root”, “xml:lang”) before adding it, it will properly return false the 1st time, then true the 2nd time as to not re-add it. However, when I get to a section in the XML that is in a different language, I'd like to update this value. What's weird is that docType.attributes is empty and there doesn't seem to be a “updateAttribute” or other similar function. And again, if I re-call docType.addAttribute, it crashes with the error: “attribute exists.”. I tried playing with the intDeclType param in addAttribute (I noticed you set this to sqDTFIXED or 3 in your example) but to no avail.
    The good news is that when the xml:lang attribute is set to the value “en-ce” or “fr-cf”, the proper dictionaries are used! I just need to be able to update this dynamically.


  • You must be logged in to reply to this topic.

Lost Your Password?