Ask Ben: Parsing Microsoft Word XML Into A Useful ColdFusion XML Document

Posted July 24, 2007 at 8:46 PM

Tags: ColdFusion, Ask Ben

Hey, do you know any way to Parse word XML data? I am trying to extract name, address, job info, etc., from a resume that was converted to XML.

My first response to this person was basically "Hell No". Dealing with Microsoft Word in any way, from an XML point of view, is basically a sure fire way to end up killing yourself (or anyone else who happens to be around you). If you've ever looked a Microsoft Word XML document, immediately your eyes hurt and I have even heard of cases where people start to bleed from their left ear. It is the craziest, longest, most convoluted XML ever imagined. When I look at it, I imagine some guys at Microsoft snickering to themselves that people on the outside actually have to deal with this.

After I responded to this person, I couldn't quite let it go. See, as ugly as it is, the Microsoft Word XML is still valid XML (terrifying, I know), which means that ColdFusion can parse it. In fact, if we saved this fairly small document as XML:


 
 
 

 
Microsoft Word Document XML Document  
 
 
 

And, then we simply read it in using CFFile and parse it using XmlParse():

 Launch code in new window » Download code as text file »

  • <!--- Read in the Microsoft Word XML file. --->
  • <cffile
  • action="read"
  • file="#ExpandPath( './document.xml' )#"
  • variable="strMSWordData"
  • />
  •  
  •  
  • <!---
  • Parse the Microsoft XML data into a ColdFusion
  • XML document.
  • --->
  • <cfset xmlDoc = XmlParse( strMSWordData ) />
  •  
  •  
  • <!--- Output the XML document. --->
  • <cfdump
  • var="#xmlDoc#"
  • label="Full Microsoft Word XML Data"
  • />

... we end up getting a CFDump like this:


 
 
 

 
Microsoft Word XML Document In ColdFusion XML Without Any Preprocessing  
 
 
 

As you can see (sort of) the XML does parse properly and does return an XML document. But, even zoomed out as much as FireFox would let me (CTRL+-), I couldn't get more than just a fraction of the XML document on my screen (notice the scroll bars). So, even though it does parse, I just feel like it doesn't give us anything useful.

But, again, it is valid XML. And as such, I thought maybe we could come up with a way to clean it up and put it into a form that we could actually use. After studying the Microsoft Word XML document followed by another hour of struggling with some regular expressions, this is what I came up with:

 Launch code in new window » Download code as text file »

  • <!--- Read in the Microsoft Word XML file. --->
  • <cffile
  • action="read"
  • file="#ExpandPath( './document.xml' )#"
  • variable="strMSWordData"
  • />
  •  
  • <!---
  • Strip out all line breaks to make our regular
  • expressions easier to handle and read (when we
  • have no line breaks, we can use the (.) operator
  • as the wild card.
  • --->
  • <cfset strMSWordData = strMSWordData.ReplaceAll(
  • "[\r\n\t]+",
  • " "
  • ) />
  •  
  • <!--- Strip out all name spaces and tag attributes. --->
  • <cfset strMSWordData = strMSWordData.ReplaceAll(
  • "(?i)(</?)(?:[^:\s>]+:)?(\w+).*?(/?>)",
  • "$1$2$3"
  • ) />
  •  
  • <!--- Strip out processing directives. --->
  • <cfset strMSWordData = strMSWordData.ReplaceFirst(
  • "(?i).+?(<worddocument)",
  • "$1"
  • ) />
  •  
  • <!--- Strip out all "other" tags. --->
  • <cfset strMSWordData = strMSWordData.ReplaceAll(
  • "(?i)</?(?!worddocument\b|body\b|p\b)\w+.*?/?>",
  • ""
  • ) />
  •  
  •  
  • <!---
  • Parse the Microsoft XML data into a ColdFusion
  • XML document.
  • --->
  • <cfset xmlDoc = XmlParse( strMSWordData ) />
  •  
  •  
  • <!--- Output the XML document. --->
  • <cfdump
  • var="#xmlDoc#"
  • label="CLEANED Microsoft Word XML Data"
  • />

Here, I am stripping out most of the tags as I possibly can while still maintaining some sort of document form. This is the CFDump output I get from this cleaned up ColdFusion XML document:


 
 
 

 
Microsoft Word XML Document In ColdFusion XML With Preprocessing / Cleaning  
 
 
 

As you can see, this is MUCH easier to at least read than that massive - I need a PhD just wrap my head around it - Microsoft Word XML document. But, of course, easier to read doesn't necessarily mean easier to use. Frankly, I think that Microsoft Word does not produce consistent enough XML to really create anything usable. At least with this this simplified format you can really on some powers of string parsing / manipulation to help you get things done.

So, long story short, not a Solution, but maybe a step in the right direction when it comes to dealing with Microsoft Word XML in ColdFusion. And, of course, Microsoft Word XML is SOOO inconsistent (if openned up MS word and re-wrote the above document, it would probably not be the same XML), who knows if these regular expressions would even work again.

Download Code Snippet ZIP File

Post Comment  |  Ask Ben  |  Permalink  |  Other Searches  |  Print Page



Learning ColdFusion 9 - ColdFusion 9 tutorials, samples, examples, demos

Reader Comments

Jul 25, 2007 at 9:03 AM // reply »
95 Comments

Ben, I hear what you are saying but it's a good attempt anyway. Hopefully, Microsoft will standardize their crap at some point in the future. Keep up the cool stuff coming.


Apr 23, 2008 at 3:36 PM // reply »
1 Comments

Thanks, this XML into CF works well. Anyway to know which fields are which instead of all just named XMLText?


Apr 23, 2008 at 3:39 PM // reply »
6,516 Comments

@Darrin,

I am not sure. I think what ever Microsoft puts in there is gonna be unpredictable.


Post Comment  |  Ask Ben

Recent Blog Comments
Nov 22, 2009 at 4:30 AM
jQuery Live() Method And Event Bubbling
dasegtezr ... read »
Nov 22, 2009 at 4:03 AM
jQuery Live() Method And Event Bubbling
C_fieri ... read »
Nov 22, 2009 at 1:56 AM
Learning ColdFusion 9: Using CFQuery In CFScript Can Enable SQL Injection Attacks
Why adobe would give you script equivalent of cfquery is beyond me. I love cfquery tag because it helps me wriite clean sql, and get away from the horrible jdbc queries If I wanted to write javali ... read »
Nov 22, 2009 at 1:45 AM
Streaming Text Using ColdFusion's CFContent Tag And The Variable Attribute
The reason you would want to do this is to stream. Ack json/xml files to ria clients I used thus technique before because putting json in response stream causes debugging info to come thru As well a ... read »
Nov 21, 2009 at 6:47 PM
Hal Helms - Real World Object Oriented Development, Sarasota - Day Five
@charlie griefer, Thank you.. ... read »
Nov 21, 2009 at 5:15 PM
Using ColdFusion Structures To Remove Duplicate List Values
@Jose Galdamez, Oh heh yeah I didn't paste the whole code. I should have defined the vars -- my bad. It's fixed thou. Thanks. ... read »
Nov 21, 2009 at 4:49 PM
Styling The ColdFusion 8 WriteToBrowser CFImage Output
Great work yet again Ben! Whilst I didn't use this whole code, I copied some of your regex code for a similar problem with the lack of an alt attribute and unescaped ampersands in CFIMAGE for Railo 3 ... read »
Nov 21, 2009 at 1:13 PM
My First ColdFusion Builder Extension - Encrypting And Decrypting CFM / CFC Files
@Ben, Because I am pedantic, I just want to make sure that everyone knows there is absolutely no encryption going on. There is only encoding and obfuscation. The cfencode tool only obfuscates your C ... read »