Pages: 1
Print
Author Topic: search and replace for special characters  (Read 4656 times)
Katriel Reichman
Member

Posts: 9


WWW
« on: September 07, 2010, 04:49:53 AM »

I need to automatically find and replace multiple consecutive line breaks in DITA topics.  Is it possible to add a search for multiple consecutive line breaks to an entry in a UWL?  If so, how?  Will it work only in code view or also in tag view?

Why we care: The multiple consecutive  line breaks in the middle of an element in the source don't display in XMetaL in tag view or in normal view, and render just fine in help and PDF, but cause fuzzy matches for translation (and, in some cases, quality problems in the translation).

Here is an example:
<p>TWO: This is an example 

with two consecutive carriage returns in this sentence before the word "with".</p>


Thanks in advance,
Katriel -- check out  The DITA Project: http://methodm.com/blog/
Logged
Derek Read
Program Manager (XMetaL)
Administrator
Member

Posts: 2621



WWW
« Reply #1 on: September 07, 2010, 04:49:57 PM »

You will not be able to fix this with the Spell Check functionality.

I'm not sure I understand how multiple carriage returns would be inserted between elements and I can't seem to reproduce this with a standard installation without any 3rd party software installed (ie: no CMS integrations, etc).

XMetaL Author does have a feature that can "pretty print" documents and this is configured to be on by default for DITA documents in XMetaL Author Enterprise. Perhaps that is part of your problem? This feature should not normally insert multiple line breaks in a row however.

Pressing the Enter key within a <p> element (in Tags On or Normal view) and inside most other elements (except those where white-space is significant such as inside <codeblock>) would not insert a carriage return. Various things may occur, with the most common being that the element would either be split (as with <p> elements) or the next most likely element would be inserted (which is often the same element when allowed), so I don't think that could be the issue.

You could try turning pretty printing off using the macro we provide for this purpose to see if that helps. Enable the Macros toolbar then look for "DITA Configuration: Turn OFF Pretty-Printing". Note that turning off pretty printing will not remove carriage returns from existing documents, it just stops additional pretty printing from being added.
Logged
Katriel Reichman
Member

Posts: 9


WWW
« Reply #2 on: September 07, 2010, 10:25:37 PM »

The extra line breaks come from content pasted from Microsoft Office applications and from our legacy conversion.
Logged
Derek Read
Program Manager (XMetaL)
Administrator
Member

Posts: 2621



WWW
« Reply #3 on: September 08, 2010, 12:08:39 PM »

There are going to be a bunch of ways to resolve this. I suppose ideally we'd alter the product to try to handle this particular case, so if you can let me know what needs to be in the Word doc to trigger the multiple carriage returns when pasting I will submit that and ask that we try to handle it automatically in a future release.

In the meantime I'll see if I can come up with something that gets around the issue. Probably a script. I assume getting some kind of solution quickly (so you can get on with things) is more important to you at this point than how it is done, and perhaps even more important than completely automating and hiding the process? ie: If you need to manually run a script that might be OK.
Logged
Derek Read
Program Manager (XMetaL)
Administrator
Member

Posts: 2621



WWW
« Reply #4 on: September 08, 2010, 12:17:39 PM »

The other important thing, one that will help me improve the speed of my script, is to know if this issue always and only affects certain elements. Perhaps it is always within <p> for example?
Logged
Derek Read
Program Manager (XMetaL)
Administrator
Member

Posts: 2621



WWW
« Reply #5 on: September 08, 2010, 01:11:22 PM »

Here are a few solutions that might work. I'm listing them in order of what I think are easiest to hardest to get working...

1) Turn on Pretty Printing for DITA documents. If it is off then your double carriage returns will be preserved (which is what seems to be occurring now). When it is turned on (which is the default) then the pretty printing feature should correct the issue automatically when you save. To turn this on for DITA run the macro called "DITA Configuration: Turn ON Pretty-Printing" in the Macros toolbar list.



2) If you must have pretty printing turned off, but also need to remove these duplicate carriage returns from within <p> elements you can add the attached MCR file to your <XMetaL Install Path>\Author\Startup folder then restart the software. The file is attached as demo_doubleCarriageReturnRemover_v1.zip so you will need to unzip it first.

Be sure to test this on non-production files (ie: make sure you have backup copies) until you trust it.

To uninstall this macro simple remove the MCR file.

Legal (for the attached MCR file):
* Licensed Materials - Property of JustSystems, Canada, Inc.
*
* (c) Copyright JustSystems Canada, Inc. 2010
* All rights reserved.
*
*-------------------------------------------------------------------
* The sample contained herein is provided to you "AS IS".
*
* It is furnished by JustSystems Corporation as a simple example and has not been
* thoroughly tested under all conditions. JustSystems Canada, Inc., therefore, cannot
* guarantee its reliability, serviceability or functionality.
*
* This sample may include the names of individuals, companies, brands and products
* in order to illustrate concepts as completely as possible. All of these names are
* fictitious and any similarity to the names and addresses used by actual persons or
* business enterprises is entirely coincidental.
*---------------------------------------------------------------------


Usage:
This MCR file will add a new macro called "DITA Workaround: Double Carriage Return Remover" to the list of macros. Before you save a document run this script. It will remove duplicate carriage returns from all <p> elements in the document. This could be extended to cover other elements, but given your initial sample I'm assuming at this point that the issue only affects <p>.

Note that script actually replaces any and all sequences of two or more white-space characters (carriage returns, tabs, or regular spaces) with a single regular space (ie: your standard space bar space, U+0020).

It also affects the content of all children of <p> elements. Given your example, and even with most other cases where <p> does contain child elements, such as <b> for example, this should not be an issue.

However, if one of the children is <codeblock> and that element contains multiple carriage returns in a row that you wish to keep they will also be replaced. Handling all the other possible cases would probably take quite a bit of investigation and more coding.

As this is a quick and dirty fix I have not attempted to take all possibilities into account. That is what we would need to try to do if we were to try to handle this at the root cause, which would likely be to try to clean up the Word content before it makes it into the document and at that point we're looking at a complete development cycle including proper testing, etc.



3) If you can identify the exact markup or styling in Word that triggers this issue then perhaps modifying the Word document before copying and pasting is another option. Word has it's own API (similar to XMetaL Author) so that could be used to automate this process if there are lots of legacy documents.



4) Try saving the document from Word as HTML then opening it in a browser and copying and pasting from there. Just a hunch, but the process of Word exporting to HTML might just cause it to fix things up for you. What you get will probably vary widely depending on your Word version so keep that in mind if there are multiple people doing this.



5) Add an additional processing step after saving (probably XSLT) that removes these carriage returns. Perhaps your translation company offers translation memory software? If they do then it might have features that will allow for "normalizing" this type of thing.

* demo_doubleCarriageReturnRemover_v1.zip (0.59 KB - downloaded 322 times.)
« Last Edit: September 08, 2010, 01:16:35 PM by Derek Read » Logged
Katriel Reichman
Member

Posts: 9


WWW
« Reply #6 on: September 11, 2010, 02:23:57 PM »

Derek -- this is great. Thank you.  The problem seems to come from content copied and pasted from Outlook messages as well as Word and we will prefer the programmatic solution that you offered, with the caveat that it might affect <codeblock> as well.  We'll let you know how this goes.

Katriel http://methodm.com/blog/
Logged
Pages: 1
Print
Jump to:  

email us