General XMetaL Discussion
Consigliere July 5, 2017 at 10:39 am
Converting legacy Word documents to DITA (or RTF to XML?) – XMetaL 12 AEJuly 5, 2017 at 10:39 amParticipants 0Replies 1Last Activity 5 years, 3 months ago
I've been struggling with trying to convert Word documents into DITA. I have a number of methods available, but none have really worked out for me. I'm using XMetaL 12 Author Enterprise.
I have tried to convert Word documents into DITA with a program called X-ICE, which is advertised as sort of an extension of XMetaL. However, there is next to no information on it online. There's the user manual online – which happens to be more complete than the manual it comes with – but no one has as much as commented on the program on any forums, let alone shared tips or tricks or anything.
My problem with X-ICE lies in that it requires you to create specific rules to “scan” Word documents with. In certain cases, it goes well enough, but in other cases, the documents' styling is like a patchwork so rules like that won't work. And even if the text is styled conveniently enough that X-ICE can go through it, its output is not simple enough to customize. It constantly runs into problems, such as not closing tags, or closing them but leaving the cursor inside the tag – things that are fairly small, but will throw a gigantic monkey wrench in the works. And EVEN if this problem was solved, it would arise all over again once documents with differently funky layouts pop up. So all in all, X-ICE doesn't sound like a very stable solution.
There is another converter program called Convertoo, but it's even worse than X-ICE in certain ways. While X-ICE's shortcomings could arguably be called user-based on some level, Convertoo simply won't work with half the Word documents I try to convert.
I've even tried looking up alternative methods, such as converting Word into RTF and then somehow converting that into DITA/XML, but that was a fairly fruitless endeavour too. I tried using Paul Tremblay's rtf2xml program, but the attempt fell short at installation.
So my question is: Does XMetaL have any form of assistance regarding the conversion of Word documents into DITA/XML? Oxygen apparently does it natively, but does XMetaL have anything similar? How do people usually deal with the need to bring a ton of legacy documents into DITA?
I've tried copying and pasting the contents of an RTF file to a Generic Topic, and although it did keep the tables, images, and other elements intact, it did have its own problems. And correcting hundreds of files manually doesn't sound too plausible…Derek Read July 10, 2017 at 6:44 pm
Reply to: Converting legacy Word documents to DITA (or RTF to XML?) – XMetaL 12 AEJuly 10, 2017 at 6:44 pm
Yes, as you have noted, there is functionality included with the DITA authoring solution in XMetaL Author Enterprise that lets you copy from Word and paste into a DITA document. It attempts to convert HTML on the Windows clipboard into DITA, so it actually works whenever there is any HTML on the clipboard, which means you can copy from Word, a browser, or any application that puts HTML on the clipboard.
Results will be mixed depending on how the original source Word document was marked up, which also influences what Word puts on the clipboard. Different versions of Word encode documents in different ways and they also end up putting different things on the clipboard.
One notable example is where two separate lists have been joined together to form what appears to be styled as a single list in Word (ie: 1. 2. + 1. 2. looks like 1. 2. 3. 4. in Word). In this case Word may put two different lists on the clipboard (1. 2. followed by a different 1. 2.) and this results in two lists being created in a DITA document (there is no way to add missing information that Word does not provide). These are the kinds of things that all Word to DITA conversion solutions are going to run into to different degrees for different reasons depending on their approach.
I don't think there is any perfect solution. Mostly due to the wide variety of proprietary Word formatting that has to be dealt with and the amount of time people are willing to spend trying to deal with them all. So, at present, no matter which solution you choose I think there is going to be some manual fixing up to do. The one benefit to using the XMetaL copy and paste feature is that the resulting document should be valid. The main drawback is that there is no batch capability.
- You must be logged in to reply to this topic.