Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
I am the chief technical officer at InVision App, Inc - a prototyping and collaboration platform for designers, built by designers. I also rock out in JavaScript and ColdFusion 24x7.
Meanwhile on Twitter
Loading latest tweet...
Ben Nadel at the jQuery Conference 2010 (Boston, MA) with:

XmlSearch() Ignores CDATA Sections In ColdFusion XPath

By Ben Nadel on
Tags: ColdFusion

I ran into a really weird problem today. I was working on an event based XML parser (along the lines of the SAX parser, but way dumbbed down) and couldn't seem to get CDATA sections to be returned in my XPath. CDATA is an escape notation that ensure that a block of text is not parsed as if it were XML, but rather utilized as plain text. It is denoted using the following syntax:

<![CDATA[ ...your character data here... ]]>

CDATA text and standard node text values are supposed to be one and the same; they both result in node text and should not be distinguishable. And, if you try to CFDump out a ColdFusion XML document, you will see that that appears to be true. Let's create an XML document that has some CDATA text:

  • <!---
  • Create an XML document in which some of the text is
  • created with inline text and some is created with the
  • use of CDATA sections.
  • --->
  • <cfxml variable="xmlData">
  •  
  • <girl>
  • <name>
  • Sarah Vivenzio
  • </name>
  • <age>
  • 27
  • </age>
  • <description>
  • <![CDATA[
  • She is totally hot! I mean way hot! Probably
  • one of the more attractive people that I have
  • ever had the pleasure of meeting.
  • ]]>
  • </description>
  • </girl>
  •  
  • </cfxml>
  •  
  •  
  • <!--- Dump out the ColdFusion XML document. --->
  • <cfdump
  • var="#xmlData#"
  • label="XmlData With CDATA Section"
  • />

As you can see, the Name and Age nodes have standard, inline text while the Description node has CDATA text. When we CFDump out this ColdFusion XML document, we get the following:


 
 
 

 
ColdFusion XML Document Containing Both Inline Text And CDATA Text  
 
 
 

When you CFDump out a ColdFusion XML document that has CDATA text, there is no distinction - all node text appears in the node XmlText attributes. This is how it should be as CDATA is just a notation, not a distinct type of node.

Now, let's try to grab all the text nodes that are grandchildren of the root Girl node:

  • <!---
  • Now, search for all text nodes in the document that
  • are nested within children of the Girl node.
  • --->
  • <cfset arrTextNodes = XmlSearch(
  • xmlData,
  • "/girl/*/text()"
  • ) />
  •  
  • <!--- Dump out text node array. --->
  • <cfdump
  • var="#arrTextNodes#"
  • label="Text Nodes via XPath"
  • />

This should grab the text nodes for Name, Age, and Description; however, when we CFDump out our array of nodes, we get the following:


 
 
 

 
Text Nodes Returned By XPath And XmlSearch() In ColdFusion When CDATA Is Used  
 
 
 

Notice that the text values for Name and Age came through fine, but he CDATA text for the Description node was totally ignored. I am pretty sure this is a bug - everything that I have read has said that inline node text and CDATA node text should not be distinguished in any way.

As an experiment, I created a ColdFusion XML document that had a mixture of inline text and CDATA text under the same parent node:

  • <!---
  • This time, let's create an XML document that mixes
  • inline node text with CDATA node text.
  • --->
  • <cfxml variable="xmlData">
  •  
  • <girl>
  • <name>
  • Sarah Vivenzio
  • </name>
  • <age>
  • 27
  • </age>
  • <description>
  • She is insanely hot. I swear, you'd have to see it
  • <![CDATA[ to believe it, but you should just ]]>
  • take my word for it. Pretty pretty pretty good.
  • </description>
  • </girl>
  •  
  • </cfxml>
  •  
  • <!--- Output description text. --->
  • <cfset WriteOutput( xmlData.girl.description.XmlText ) />

Here, the girl Description node has intermingled text types. And, as documented, this creates a single text value when accessed directly via xmlData.girl.description.XmlText:

She is insanely hot. I swear, you'd have to see it to believe it, but you should just take my word for it. Pretty pretty pretty good.

However, when we use XPath to get the text node:

  • <!---
  • Now, search for all text nodes in the document that
  • are nested within children of the Girl node.
  • --->
  • <cfset arrTextNodes = XmlSearch(
  • xmlData,
  • "/girl/*/text()"
  • ) />
  •  
  • <!--- Dump out text node array. --->
  • <cfdump
  • var="#arrTextNodes#"
  • label="Text Nodes via XPath"
  • />

... we get the following CFDump output of our text nodes:


 
 
 

 
Text Nodes Returned By XPath And XmlSearch() When Inline Text And CDATA Is Used In Same Node  
 
 
 

Here, something really weird happens - we only get the text node data up to the opening of the CDATA section. The rest of the text value, include the text that comes after the close CDATA section, is completely ignored.

I could be wrong, but this seems like a very serious bug to me. This can create all sorts of complications for building data import solutions in which you get XML from any sort of third party. Has anyone else experienced this problem and is there a way to get around it?




Reader Comments

Just a small typo in the following statement:

It is denoted using the following syntax: <!CDATA[ ...your character data here... ]]>

Should be: <![CDATA[ ...your character data here... ]]>

Notice the first [.

Reply to this Comment

Yes - it looks like there was a problem reported for Xalan 2.5.1 which is used by CF8. Some people say this was fixed in Xalan 2.7.1. I tried upgrading Xalan to 2.7.1 but at the moment I have error 500 NullPointerException everywhere.

Reply to this Comment

@Ben,

Is it possible that the CDATA block is being picked up but the <[ ... ]> brackets are not being stripped off by the text() function? In other words, in the XmlSearch() version, if you view page source, is the content there, but just not rendered to screen by the browser?

Reply to this Comment

If you do this
<cfset arrTextNodes = XmlSearch(xmlData,"/girl/*" ) />

you will get the description as well, I think when you add text() at the end it picks up only the text values where cdata is not really text...... You can get it via .XmlText or .XmlCdata whcih essentially return the same thing but not really sure why it doesnt return the description.

Reply to this Comment

Hmmm... seems like it is not fixed. If you replace xalan.jar, xml-apis.jar and add serializer.jar from xalan 2.7.1 CF will work fine but you issue is still there.

Reply to this Comment

@Anuj,

That gives me the element nodes under girls, but not quite the CDATA text, unless I access is directly as an XmlText attribute of one of the nodes.

I was working on some more dynamic stuff where I was specifically looking for a text node that was in an XML document created on the fly. I am "working around" the issue by doing something like this:

<cfif Find( "CDATA", NodeString )>

<!--- Fix for CDATA. --->
<cfset arrChildren[ 1 ].XmlText = xmlDoc.root.XmlText />

</cfif>

That seems to take care of the problem; I am lucky that I am such a controlled environment.

Reply to this Comment

Hi Ben,
I just came across this same problem in the current version of CF.

Did you ever log a bug for it? I can't find it in the public tracker?

Also, I did some sample code - but while it picks up both lots of text outside of the CDATA portion - it omits the internals of CDATA, any ideas?

  • <!--- create XML --->
  • <cfxml variable="myXML">
  • <!--- construct the XML document --->
  • <sampleXML>
  • <sampleXMLNode>
  • This is some plain text 1.
  • <![CDATA[ This is Plain Text 2 ]]>
  • This is more plain text in the sampleXML node.
  • </sampleXMLNode>
  • </sampleXML>
  • </cfxml>
  •  
  • <cfset xmlNodes = XmlSearch(myXml, '/sampleXML/*/text()') >
  •  
  • <!--- Dump out all the nodes of type text. --->
  • <cfdump var="#xmlNodes#" label="1">
  •  
  • <cfif Find( "CDATA", myXML )>
  • <!--- Fix for CDATA. --->
  • <cfset xmlNodes[ 1 ].XmlValue =myXML.sampleXML.sampleXMLNode />
  • </cfif>
  •  
  • <!--- Dump out all the nodes of type text after treatment.. --->
  • <cfdump var="#xmlNodes#" label="2">

Reply to this Comment

Post A Comment

You — Get Out Of My Dreams, Get Into My Comments
Live in the Now
Oops!
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.