XmlSearch() Ignores CDATA Sections In ColdFusion XPath
I ran into a really weird problem today. I was working on an event based XML parser (along the lines of the SAX parser, but way dumbbed down) and couldn't seem to get CDATA sections to be returned in my XPath. CDATA is an escape notation that ensure that a block of text is not parsed as if it were XML, but rather utilized as plain text. It is denoted using the following syntax:
<![CDATA[ ...your character data here... ]]>
CDATA text and standard node text values are supposed to be one and the same; they both result in node text and should not be distinguishable. And, if you try to CFDump out a ColdFusion XML document, you will see that that appears to be true. Let's create an XML document that has some CDATA text:
<!---
Create an XML document in which some of the text is
created with inline text and some is created with the
use of CDATA sections.
--->
<cfxml variable="xmlData">
<girl>
<name>
Sarah Vivenzio
</name>
<age>
27
</age>
<description>
<![CDATA[
She is totally hot! I mean way hot! Probably
one of the more attractive people that I have
ever had the pleasure of meeting.
]]>
</description>
</girl>
</cfxml>
<!--- Dump out the ColdFusion XML document. --->
<cfdump
var="#xmlData#"
label="XmlData With CDATA Section"
/>
As you can see, the Name and Age nodes have standard, inline text while the Description node has CDATA text. When we CFDump out this ColdFusion XML document, we get the following:
When you CFDump out a ColdFusion XML document that has CDATA text, there is no distinction - all node text appears in the node XmlText attributes. This is how it should be as CDATA is just a notation, not a distinct type of node.
Now, let's try to grab all the text nodes that are grandchildren of the root Girl node:
<!---
Now, search for all text nodes in the document that
are nested within children of the Girl node.
--->
<cfset arrTextNodes = XmlSearch(
xmlData,
"/girl/*/text()"
) />
<!--- Dump out text node array. --->
<cfdump
var="#arrTextNodes#"
label="Text Nodes via XPath"
/>
This should grab the text nodes for Name, Age, and Description; however, when we CFDump out our array of nodes, we get the following:
Notice that the text values for Name and Age came through fine, but he CDATA text for the Description node was totally ignored. I am pretty sure this is a bug - everything that I have read has said that inline node text and CDATA node text should not be distinguished in any way.
As an experiment, I created a ColdFusion XML document that had a mixture of inline text and CDATA text under the same parent node:
<!---
This time, let's create an XML document that mixes
inline node text with CDATA node text.
--->
<cfxml variable="xmlData">
<girl>
<name>
Sarah Vivenzio
</name>
<age>
27
</age>
<description>
She is insanely hot. I swear, you'd have to see it
<![CDATA[ to believe it, but you should just ]]>
take my word for it. Pretty pretty pretty good.
</description>
</girl>
</cfxml>
<!--- Output description text. --->
<cfset WriteOutput( xmlData.girl.description.XmlText ) />
Here, the girl Description node has intermingled text types. And, as documented, this creates a single text value when accessed directly via xmlData.girl.description.XmlText:
She is insanely hot. I swear, you'd have to see it to believe it, but you should just take my word for it. Pretty pretty pretty good.
However, when we use XPath to get the text node:
<!---
Now, search for all text nodes in the document that
are nested within children of the Girl node.
--->
<cfset arrTextNodes = XmlSearch(
xmlData,
"/girl/*/text()"
) />
<!--- Dump out text node array. --->
<cfdump
var="#arrTextNodes#"
label="Text Nodes via XPath"
/>
... we get the following CFDump output of our text nodes:
Here, something really weird happens - we only get the text node data up to the opening of the CDATA section. The rest of the text value, include the text that comes after the close CDATA section, is completely ignored.
I could be wrong, but this seems like a very serious bug to me. This can create all sorts of complications for building data import solutions in which you get XML from any sort of third party. Has anyone else experienced this problem and is there a way to get around it?
Want to use code from this post? Check out the license.
Reader Comments
Just a small typo in the following statement:
It is denoted using the following syntax: <!CDATA[ ...your character data here... ]]>
Should be: <![CDATA[ ...your character data here... ]]>
Notice the first [.
@Gatzby,
Ooops, good catch. It has been fixed.
Yes - it looks like there was a problem reported for Xalan 2.5.1 which is used by CF8. Some people say this was fixed in Xalan 2.7.1. I tried upgrading Xalan to 2.7.1 but at the moment I have error 500 NullPointerException everywhere.
@Ben,
Is it possible that the CDATA block is being picked up but the <[ ... ]> brackets are not being stripped off by the text() function? In other words, in the XmlSearch() version, if you view page source, is the content there, but just not rendered to screen by the browser?
If you do this
<cfset arrTextNodes = XmlSearch(xmlData,"/girl/*" ) />
you will get the description as well, I think when you add text() at the end it picks up only the text values where cdata is not really text...... You can get it via .XmlText or .XmlCdata whcih essentially return the same thing but not really sure why it doesnt return the description.
@jfish,
Good thinking. I tried that, but it is not in the source either.
Hmmm... seems like it is not fixed. If you replace xalan.jar, xml-apis.jar and add serializer.jar from xalan 2.7.1 CF will work fine but you issue is still there.
@Anuj,
That gives me the element nodes under girls, but not quite the CDATA text, unless I access is directly as an XmlText attribute of one of the nodes.
I was working on some more dynamic stuff where I was specifically looking for a text node that was in an XML document created on the fly. I am "working around" the issue by doing something like this:
<cfif Find( "CDATA", NodeString )>
<!--- Fix for CDATA. --->
<cfset arrChildren[ 1 ].XmlText = xmlDoc.root.XmlText />
</cfif>
That seems to take care of the problem; I am lucky that I am such a controlled environment.
@Radekg,
Thanks for looking into it.
Hi Ben,
I just came across this same problem in the current version of CF.
Did you ever log a bug for it? I can't find it in the public tracker?
Also, I did some sample code - but while it picks up both lots of text outside of the CDATA portion - it omits the internals of CDATA, any ideas?