Ask Ben: Parsing Microsoft Word XML Into A Useful ColdFusion XML Document

Posted July 24, 2007 at 8:46 PM by Ben Nadel

Tags: ColdFusion, Ask Ben

Hey, do you know any way to Parse word XML data? I am trying to extract name, address, job info, etc., from a resume that was converted to XML.

My first response to this person was basically "Hell No". Dealing with Microsoft Word in any way, from an XML point of view, is basically a sure fire way to end up killing yourself (or anyone else who happens to be around you). If you've ever looked a Microsoft Word XML document, immediately your eyes hurt and I have even heard of cases where people start to bleed from their left ear. It is the craziest, longest, most convoluted XML ever imagined. When I look at it, I imagine some guys at Microsoft snickering to themselves that people on the outside actually have to deal with this.

After I responded to this person, I couldn't quite let it go. See, as ugly as it is, the Microsoft Word XML is still valid XML (terrifying, I know), which means that ColdFusion can parse it. In fact, if we saved this fairly small document as XML:


 
 
 

 
Microsoft Word Document XML Document  
 
 
 

And, then we simply read it in using CFFile and parse it using XmlParse():

  • <!--- Read in the Microsoft Word XML file. --->
  • <cffile
  • action="read"
  • file="#ExpandPath( './document.xml' )#"
  • variable="strMSWordData"
  • />
  •  
  •  
  • <!---
  • Parse the Microsoft XML data into a ColdFusion
  • XML document.
  • --->
  • <cfset xmlDoc = XmlParse( strMSWordData ) />
  •  
  •  
  • <!--- Output the XML document. --->
  • <cfdump
  • var="#xmlDoc#"
  • label="Full Microsoft Word XML Data"
  • />

... we end up getting a CFDump like this:


 
 
 

 
Microsoft Word XML Document In ColdFusion XML Without Any Preprocessing  
 
 
 

As you can see (sort of) the XML does parse properly and does return an XML document. But, even zoomed out as much as FireFox would let me (CTRL+-), I couldn't get more than just a fraction of the XML document on my screen (notice the scroll bars). So, even though it does parse, I just feel like it doesn't give us anything useful.

But, again, it is valid XML. And as such, I thought maybe we could come up with a way to clean it up and put it into a form that we could actually use. After studying the Microsoft Word XML document followed by another hour of struggling with some regular expressions, this is what I came up with:

  • <!--- Read in the Microsoft Word XML file. --->
  • <cffile
  • action="read"
  • file="#ExpandPath( './document.xml' )#"
  • variable="strMSWordData"
  • />
  •  
  • <!---
  • Strip out all line breaks to make our regular
  • expressions easier to handle and read (when we
  • have no line breaks, we can use the (.) operator
  • as the wild card.
  • --->
  • <cfset strMSWordData = strMSWordData.ReplaceAll(
  • "[\r\n\t]+",
  • " "
  • ) />
  •  
  • <!--- Strip out all name spaces and tag attributes. --->
  • <cfset strMSWordData = strMSWordData.ReplaceAll(
  • "(?i)(</?)(?:[^:\s>]+:)?(\w+).*?(/?>)",
  • "$1$2$3"
  • ) />
  •  
  • <!--- Strip out processing directives. --->
  • <cfset strMSWordData = strMSWordData.ReplaceFirst(
  • "(?i).+?(<worddocument)",
  • "$1"
  • ) />
  •  
  • <!--- Strip out all "other" tags. --->
  • <cfset strMSWordData = strMSWordData.ReplaceAll(
  • "(?i)</?(?!worddocument\b|body\b|p\b)\w+.*?/?>",
  • ""
  • ) />
  •  
  •  
  • <!---
  • Parse the Microsoft XML data into a ColdFusion
  • XML document.
  • --->
  • <cfset xmlDoc = XmlParse( strMSWordData ) />
  •  
  •  
  • <!--- Output the XML document. --->
  • <cfdump
  • var="#xmlDoc#"
  • label="CLEANED Microsoft Word XML Data"
  • />

Here, I am stripping out most of the tags as I possibly can while still maintaining some sort of document form. This is the CFDump output I get from this cleaned up ColdFusion XML document:


 
 
 

 
Microsoft Word XML Document In ColdFusion XML With Preprocessing / Cleaning  
 
 
 

As you can see, this is MUCH easier to at least read than that massive - I need a PhD just wrap my head around it - Microsoft Word XML document. But, of course, easier to read doesn't necessarily mean easier to use. Frankly, I think that Microsoft Word does not produce consistent enough XML to really create anything usable. At least with this this simplified format you can really on some powers of string parsing / manipulation to help you get things done.

So, long story short, not a Solution, but maybe a step in the right direction when it comes to dealing with Microsoft Word XML in ColdFusion. And, of course, Microsoft Word XML is SOOO inconsistent (if openned up MS word and re-wrote the above document, it would probably not be the same XML), who knows if these regular expressions would even work again.



Reader Comments

Jul 25, 2007 at 9:03 AM // reply »
95 Comments

Ben, I hear what you are saying but it's a good attempt anyway. Hopefully, Microsoft will standardize their crap at some point in the future. Keep up the cool stuff coming.


Apr 23, 2008 at 3:36 PM // reply »
1 Comments

Thanks, this XML into CF works well. Anyway to know which fields are which instead of all just named XMLText?


Apr 23, 2008 at 3:39 PM // reply »
11,238 Comments

@Darrin,

I am not sure. I think what ever Microsoft puts in there is gonna be unpredictable.


Jun 10, 2010 at 9:22 PM // reply »
5 Comments

I am using this above code to read both core.xml and custom.xml of the meta data for office documents.

However I am wondering how I could update or add new values into the xml?

On the other hand, is there better/updated code around that can successfully read/update the meta data of office documents?


Post A Comment

Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.

Please review the following issues:

Author Name:


Author Email:

Author Website:

Comment:

Supported HTML tags for formatting: <strong>bold</strong>   <em>italic</em>   <code>code</code>







  • Help Wanted - Find Your Next ColdFusion Job
Ben Nadel's Company - Epicenter Consulting Recent Blog Comments
May 17, 2013 at 7:42 PM
HashKeyCopier - An AngularJS Utility Class For Merging Cached And Live Data
Ben - thanks so much for posting these Angular articles and findings, they've been a huge help towards learning one of the more 'complex' JavaScript frameworks out there (IMO). I have been using Angu ... read »
May 16, 2013 at 5:01 PM
UPDATE: Parsing CSV Data Files In ColdFusion With csvToArray()
Your code was the closest thing I've found to obtaining some direction for converting ISO fields to values that CF can translate properly. Thank you for posting! ... read »
May 15, 2013 at 10:37 PM
Very Simple Pusher And ColdFusion Powered Chat
hi id making plz easy ... read »
May 15, 2013 at 6:07 PM
Making SOAP Web Service Requests With ColdFusion And CFHTTP
Ben, you once again saved my bacon at work. Thank you, thank you, thank you! ... read »
May 15, 2013 at 4:15 PM
What If All User Interface (UI) Data Came In Reports?
@Josh, Thanks! @Ben, I definitely recommend the David West book "Object Thinking" I've been quoting from. It goes deeply into the philosophy and history of OO programming. His breadth ... read »
May 15, 2013 at 11:36 AM
Ask Ben: Print Part Of A Web Page With jQuery
I found this helpfull when you need to keep (refresh) the original parent page after closing the iframe child print dialog (Hoping you're not using a form at this time so it won't submit again): On ... read »
May 14, 2013 at 7:13 PM
What If All User Interface (UI) Data Came In Reports?
@Jonah, If there's any books you'd recommend on the subject of domain modelling, I'd love to hear it. I just downloaded the free PDF of "Domain Driven Design Quickly". Figured I'd give it ... read »
May 14, 2013 at 6:57 PM
The UX Of Prototyping: Low-Fidelity Is The New High-Fidelity
@Phillip, I'm not sure I follow what you mean? Are you saying that you looked at the list of widgets provided by the jQuery UI and let that be your style guide? ... read »
InVision App - Prototyping Made Beautiful With Prototyping Tools