XmlSearch() Ignores CDATA Sections In ColdFusion XPath

Posted May 21, 2008 at 9:02 AM by Ben Nadel

Tags: ColdFusion

I ran into a really weird problem today. I was working on an event based XML parser (along the lines of the SAX parser, but way dumbbed down) and couldn't seem to get CDATA sections to be returned in my XPath. CDATA is an escape notation that ensure that a block of text is not parsed as if it were XML, but rather utilized as plain text. It is denoted using the following syntax:

<![CDATA[ ...your character data here... ]]>

CDATA text and standard node text values are supposed to be one and the same; they both result in node text and should not be distinguishable. And, if you try to CFDump out a ColdFusion XML document, you will see that that appears to be true. Let's create an XML document that has some CDATA text:

  • <!---
  • Create an XML document in which some of the text is
  • created with inline text and some is created with the
  • use of CDATA sections.
  • --->
  • <cfxml variable="xmlData">
  •  
  • <girl>
  • <name>
  • Sarah Vivenzio
  • </name>
  • <age>
  • 27
  • </age>
  • <description>
  • <![CDATA[
  • She is totally hot! I mean way hot! Probably
  • one of the more attractive people that I have
  • ever had the pleasure of meeting.
  • ]]>
  • </description>
  • </girl>
  •  
  • </cfxml>
  •  
  •  
  • <!--- Dump out the ColdFusion XML document. --->
  • <cfdump
  • var="#xmlData#"
  • label="XmlData With CDATA Section"
  • />

As you can see, the Name and Age nodes have standard, inline text while the Description node has CDATA text. When we CFDump out this ColdFusion XML document, we get the following:


 
 
 

 
ColdFusion XML Document Containing Both Inline Text And CDATA Text  
 
 
 

When you CFDump out a ColdFusion XML document that has CDATA text, there is no distinction - all node text appears in the node XmlText attributes. This is how it should be as CDATA is just a notation, not a distinct type of node.

Now, let's try to grab all the text nodes that are grandchildren of the root Girl node:

  • <!---
  • Now, search for all text nodes in the document that
  • are nested within children of the Girl node.
  • --->
  • <cfset arrTextNodes = XmlSearch(
  • xmlData,
  • "/girl/*/text()"
  • ) />
  •  
  • <!--- Dump out text node array. --->
  • <cfdump
  • var="#arrTextNodes#"
  • label="Text Nodes via XPath"
  • />

This should grab the text nodes for Name, Age, and Description; however, when we CFDump out our array of nodes, we get the following:


 
 
 

 
Text Nodes Returned By XPath And XmlSearch() In ColdFusion When CDATA Is Used  
 
 
 

Notice that the text values for Name and Age came through fine, but he CDATA text for the Description node was totally ignored. I am pretty sure this is a bug - everything that I have read has said that inline node text and CDATA node text should not be distinguished in any way.

As an experiment, I created a ColdFusion XML document that had a mixture of inline text and CDATA text under the same parent node:

  • <!---
  • This time, let's create an XML document that mixes
  • inline node text with CDATA node text.
  • --->
  • <cfxml variable="xmlData">
  •  
  • <girl>
  • <name>
  • Sarah Vivenzio
  • </name>
  • <age>
  • 27
  • </age>
  • <description>
  • She is insanely hot. I swear, you'd have to see it
  • <![CDATA[ to believe it, but you should just ]]>
  • take my word for it. Pretty pretty pretty good.
  • </description>
  • </girl>
  •  
  • </cfxml>
  •  
  • <!--- Output description text. --->
  • <cfset WriteOutput( xmlData.girl.description.XmlText ) />

Here, the girl Description node has intermingled text types. And, as documented, this creates a single text value when accessed directly via xmlData.girl.description.XmlText:

She is insanely hot. I swear, you'd have to see it to believe it, but you should just take my word for it. Pretty pretty pretty good.

However, when we use XPath to get the text node:

  • <!---
  • Now, search for all text nodes in the document that
  • are nested within children of the Girl node.
  • --->
  • <cfset arrTextNodes = XmlSearch(
  • xmlData,
  • "/girl/*/text()"
  • ) />
  •  
  • <!--- Dump out text node array. --->
  • <cfdump
  • var="#arrTextNodes#"
  • label="Text Nodes via XPath"
  • />

... we get the following CFDump output of our text nodes:


 
 
 

 
Text Nodes Returned By XPath And XmlSearch() When Inline Text And CDATA Is Used In Same Node  
 
 
 

Here, something really weird happens - we only get the text node data up to the opening of the CDATA section. The rest of the text value, include the text that comes after the close CDATA section, is completely ignored.

I could be wrong, but this seems like a very serious bug to me. This can create all sorts of complications for building data import solutions in which you get XML from any sort of third party. Has anyone else experienced this problem and is there a way to get around it?




Reader Comments

May 21, 2008 at 9:13 AM // reply »
1 Comments

Just a small typo in the following statement:

It is denoted using the following syntax: <!CDATA[ ...your character data here... ]]>

Should be: <![CDATA[ ...your character data here... ]]>

Notice the first [.


May 21, 2008 at 9:16 AM // reply »
10,640 Comments

@Gatzby,

Ooops, good catch. It has been fixed.


May 21, 2008 at 10:05 AM // reply »
14 Comments

Yes - it looks like there was a problem reported for Xalan 2.5.1 which is used by CF8. Some people say this was fixed in Xalan 2.7.1. I tried upgrading Xalan to 2.7.1 but at the moment I have error 500 NullPointerException everywhere.


May 21, 2008 at 10:07 AM // reply »
131 Comments

@Ben,

Is it possible that the CDATA block is being picked up but the <[ ... ]> brackets are not being stripped off by the text() function? In other words, in the XmlSearch() version, if you view page source, is the content there, but just not rendered to screen by the browser?


May 21, 2008 at 10:09 AM // reply »
12 Comments

If you do this
<cfset arrTextNodes = XmlSearch(xmlData,"/girl/*" ) />

you will get the description as well, I think when you add text() at the end it picks up only the text values where cdata is not really text...... You can get it via .XmlText or .XmlCdata whcih essentially return the same thing but not really sure why it doesnt return the description.


May 21, 2008 at 10:11 AM // reply »
10,640 Comments

@jfish,

Good thinking. I tried that, but it is not in the source either.


May 21, 2008 at 10:21 AM // reply »
14 Comments

Hmmm... seems like it is not fixed. If you replace xalan.jar, xml-apis.jar and add serializer.jar from xalan 2.7.1 CF will work fine but you issue is still there.


May 21, 2008 at 10:21 AM // reply »
10,640 Comments

@Anuj,

That gives me the element nodes under girls, but not quite the CDATA text, unless I access is directly as an XmlText attribute of one of the nodes.

I was working on some more dynamic stuff where I was specifically looking for a text node that was in an XML document created on the fly. I am "working around" the issue by doing something like this:

<cfif Find( "CDATA", NodeString )>

<!--- Fix for CDATA. --->
<cfset arrChildren[ 1 ].XmlText = xmlDoc.root.XmlText />

</cfif>

That seems to take care of the problem; I am lucky that I am such a controlled environment.


May 21, 2008 at 10:22 AM // reply »
10,640 Comments

@Radekg,

Thanks for looking into it.


Post A Comment

Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.

Please review the following issues:

Author Name:


Author Email:

Author Website:

Comment:

Supported HTML tags for formatting: <strong>bold</strong>   <em>italic</em>   <code>code</code>







  • Help Wanted - Find Your Next ColdFusion Job
InVision App - Prototyping Made Beautiful With Prototyping Tools Ben Nadel's Company - Epicenter Consulting Recent Blog Comments
Feb 10, 2012 at 7:21 PM
jQuery AJAX Strips Script Tags And Inserts Them After Parent-Most Elements
Update! Instead of $(eval(options.insertAfter)).after(data['insertData']); I now use: var ajaxNode = document.createElement('span'); var parent = $(eval(options.insertAfter))[0].parentNode; ... read »
Feb 10, 2012 at 6:18 PM
jQuery AJAX Strips Script Tags And Inserts Them After Parent-Most Elements
encountered this same, what I consider, jQuery bug last week. I'm building a site in which I load some content via AJAX. This content contains Linkedin share button placeholders which Linkedin API ne ... read »
Feb 10, 2012 at 11:30 AM
Cross-Origin Resource Sharing (CORS) AJAX Requests Between jQuery And Node.js
After you understand the concepts here, this is an awesome cheatsheet for enabling CORS in just about anything http://enable-cors.org/ ... read »
JM
Feb 10, 2012 at 9:10 AM
My Safari Browser SQLite Database Hello World Example
@Amy, Here is a very good tutorial on how to use JOIN: http://www.sqltutorial.org/sqljoin-innerjoin.aspx ... read »
Feb 10, 2012 at 4:42 AM
Building A Twitter-Inspired RESTful API Architecture In ColdFusion
This is great, very useful Ben. I spotted a small typo in the api.cgm listing: <cfthrow type="Unauthroized" /> Cheers Stefan ... read »
Feb 9, 2012 at 10:35 PM
CFDirectory Filtering Uses Pipe Character For Multiple Filters (Thanks Steve Withington)
I was wondering if there would be a filter you could apply so that you got everything but what you included in the filter. As in show me all docs that are not a .pdf. ... read »
Feb 9, 2012 at 10:29 PM
Learning ColdFusion 9: Application-Specific Data Sources
@Ben, No offence, but if people were really wanting advanced features they would be using a platform like ASP.NET MVC. CFML is so structurally compromised as a tag-based scripting language that ... read »
Feb 9, 2012 at 10:03 PM
Subversion - Cleanup Failed To Process The Following Paths
@Leviaguirre, do you still have problems with this? ... read »