Ask Ben: Parsing Microsoft Word XML Into A Useful ColdFusion XML Document

By Ben Nadel

Published 2007-07-24 in Ask Ben, ColdFusion — Comments (4)

Hey, do you know any way to Parse word XML data? I am trying to extract name, address, job info, etc., from a resume that was converted to XML.

My first response to this person was basically "Hell No". Dealing with Microsoft Word in any way, from an XML point of view, is basically a sure fire way to end up killing yourself (or anyone else who happens to be around you). If you've ever looked a Microsoft Word XML document, immediately your eyes hurt and I have even heard of cases where people start to bleed from their left ear. It is the craziest, longest, most convoluted XML ever imagined. When I look at it, I imagine some guys at Microsoft snickering to themselves that people on the outside actually have to deal with this.

After I responded to this person, I couldn't quite let it go. See, as ugly as it is, the Microsoft Word XML is still valid XML (terrifying, I know), which means that ColdFusion can parse it. In fact, if we saved this fairly small document as XML:

And, then we simply read it in using CFFile and parse it using XmlParse():

<!--- Read in the Microsoft Word XML file. --->
<cffile
	action="read"
	file="#ExpandPath( './document.xml' )#"
	variable="strMSWordData"
	/>


<!---
	Parse the Microsoft XML data into a ColdFusion
	XML document.
--->
<cfset xmlDoc = XmlParse( strMSWordData ) />


<!--- Output the XML document. --->
<cfdump
	var="#xmlDoc#"
	label="Full Microsoft Word XML Data"
	/>

... we end up getting a CFDump like this:

Microsoft Word XML Document In ColdFusion XML Without Any Preprocessing

As you can see (sort of) the XML does parse properly and does return an XML document. But, even zoomed out as much as FireFox would let me (CTRL+-), I couldn't get more than just a fraction of the XML document on my screen (notice the scroll bars). So, even though it does parse, I just feel like it doesn't give us anything useful.

But, again, it is valid XML. And as such, I thought maybe we could come up with a way to clean it up and put it into a form that we could actually use. After studying the Microsoft Word XML document followed by another hour of struggling with some regular expressions, this is what I came up with:

<!--- Read in the Microsoft Word XML file. --->
<cffile
	action="read"
	file="#ExpandPath( './document.xml' )#"
	variable="strMSWordData"
	/>

<!---
	Strip out all line breaks to make our regular
	expressions easier to handle and read (when we
	have no line breaks, we can use the (.) operator
	as the wild card.
--->
<cfset strMSWordData = strMSWordData.ReplaceAll(
	"[\r\n\t]+",
	" "
	) />

<!--- Strip out all name spaces and tag attributes. --->
<cfset strMSWordData = strMSWordData.ReplaceAll(
	"(?i)(</?)(?:[^:\s>]+:)?(\w+).*?(/?>)",
	"$1$2$3"
	) />

<!--- Strip out processing directives. --->
<cfset strMSWordData = strMSWordData.ReplaceFirst(
	"(?i).+?(<worddocument)",
	"$1"
	) />

<!--- Strip out all "other" tags. --->
<cfset strMSWordData = strMSWordData.ReplaceAll(
	"(?i)</?(?!worddocument\b|body\b|p\b)\w+.*?/?>",
	""
	) />


<!---
	Parse the Microsoft XML data into a ColdFusion
	XML document.
--->
<cfset xmlDoc = XmlParse( strMSWordData ) />


<!--- Output the XML document. --->
<cfdump
	var="#xmlDoc#"
	label="CLEANED Microsoft Word XML Data"
	/>

Here, I am stripping out most of the tags as I possibly can while still maintaining some sort of document form. This is the CFDump output I get from this cleaned up ColdFusion XML document:

Microsoft Word XML Document In ColdFusion XML With Preprocessing / Cleaning

As you can see, this is MUCH easier to at least read than that massive - I need a PhD just wrap my head around it - Microsoft Word XML document. But, of course, easier to read doesn't necessarily mean easier to use. Frankly, I think that Microsoft Word does not produce consistent enough XML to really create anything usable. At least with this this simplified format you can really on some powers of string parsing / manipulation to help you get things done.

So, long story short, not a Solution, but maybe a step in the right direction when it comes to dealing with Microsoft Word XML in ColdFusion. And, of course, Microsoft Word XML is SOOO inconsistent (if openned up MS word and re-wrote the above document, it would probably not be the same XML), who knows if these regular expressions would even work again.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/859

Reader Comments

Boyan Kostadinov Jul 25, 2007 at 9:03 AM

95 Comments

Ben, I hear what you are saying but it's a good attempt anyway. Hopefully, Microsoft will standardize their crap at some point in the future. Keep up the cool stuff coming.