Nylon Technology Presentation: Introduction To XPath And XmlSearch() In ColdFusion

Posted July 13, 2007 at 7:12 AM

Tags: ColdFusion

As some of you might know, I give the occasional presentation here at our Nylon Technology staff meetings. After attending an XML session at CFUNITED, I thought it would be cool to give a presentation about XPath as it is something that I am starting to use more and more both in ColdFusion and in Javascript. I am by no way a master of XPath, and in fact, I was learning as I wrote this. It's really cool stuff, thought, and with a fairly simple syntax, you can really explore XML documents in a very powerful way.

So, sorry if there is any misinformation in here :)


The more recent releases of ColdFusion have really made great strides in XML document modelling and XML manipulation. With these innovations, the explosion of XML-based web services (SOAP, XML-RPC, etc), the use of styled XML web sites (VisualJQuery.com, World of Warcraft Armory), and Javascript libraries that can traverse the DOM using XPath (jQuery), it is becoming more an more valuable to really know how to leverage all the XML features that ColdFusion puts at our disposal.

ColdFusion's XmlSearch() function allows us to easily search XML documents using XPath:

 Launch code in new window » Download code as text file »

  • <cfset arrNodes = XmlSearch( XML_DOCUMENT, XPATH_QUERY ) />

XPath is a syntax for defining parts of an XML document that might consist of one node or multiple node sets. XmlSearch() always returns an array of nodes. If the XPath argument does not result in any matching nodes, XmlSearch() will return an empty array. So long as your XPath syntax is valid, XmlSearch() will never throw an error.

For this tutorial, let's build a ColdFusion XML document that will be used in all of our examples:

 Launch code in new window » Download code as text file »

  • <!---
  • Let's create an XML document. For simplicities sake,
  • we are NOT going to use any name spaces because that
  • just complicates our lives. We are going to create
  • an xml tree of Movie data that will be used in the
  • rest of our examples.
  • --->
  • <cfxml variable="xmlData">
  •  
  • <?xml version="1.0" encoding="utf-8" ?>
  • <movies>
  • <movie
  • imdbtitle="tt0399146"
  • dateadded="07/12/2007">
  •  
  • <name>A History of Violence</name>
  • <releasedate>09/30/2005</releasedate>
  • <genres>
  • <genre>Action</genre>
  • <genre>Crime</genre>
  • <genre>Drama</genre>
  • <genre>Thriller</genre>
  • </genres>
  •  
  • </movie>
  • <movie
  • imdbtitle="tt0265349"
  • dateadded="07/06/2007">
  •  
  • <name>The Mothman Prophecies</name>
  • <releasedate>01/25/2002</releasedate>
  • <genres>
  • <genre>Drama</genre>
  • <genre>Horror</genre>
  • <genre>Mystery</genre>
  • <genre>Thriller</genre>
  • </genres>
  •  
  • </movie>
  • <movie
  • imdbtitle="tt0166924"
  • dateadded="07/01/2007">
  •  
  • <name>Mulholland Dr.</name>
  • <releasedate>11/26/2001</releasedate>
  • <genres>
  • <genre>Drama</genre>
  • <genre>Mystery</genre>
  • <genre>Thriller</genre>
  • </genres>
  •  
  • </movie>
  • </movies>
  •  
  • </cfxml>

This XML document contains movie data including IMDB lookup IDs, titles, release dates, and genres. I am trying to keep it simple, but at the same time use a good mix of tags, attributes, and node nesting.

While we might traditionally think of a Node as tag within the XML document object (or within the XHTML document object model), technically, just about everything contained in an XML document is some form of node. In fact, in XPath, there are seven kinds of nodes:

  • Element (a tag)
  • Attribute (name-value pair inside of a tag)
  • Text
  • Namespace
  • Processing-instruction (such as style sheet or doc type)
  • Comment
  • Document (root element)

The Document element is always the first Element node within the XML document and is the Ancestor of all other nodes in the XML document (with the exception of the processing instruction nodes). In our XML document, the movies node is our Document element.

Now that we have our XML document set up and a basic understanding of the XML node structures, let's take a look at how XPath can help us search for nodes or node sets.

If your XPath contains just the name of the root element, XmlSearch() will select the root element and all children of the root element's children. Running this query:

 Launch code in new window » Download code as text file »

  • <!--- Select the root node and all its children. --->
  • <cfset arrNodes = XmlSearch(
  • xmlData,
  • "movies"
  • ) />
  •  
  • <!--- Dump out resultant nodes. --->
  • <cfdump
  • var="#arrNodes#"
  • label="Named Node Selection"
  • />

... will result in the following CFDump output:


 
 
 

 
Using XPath In ColdFusion And XmlSearch()  
 
 
 

As you can see, the resultant array contains one element, the root XML node, which, in turn, contains the rest of the XML document. This is the least useful of any type of XPath search. In fact, the same root element can be accessed directly through the XML object via xmlData.movies.

Selecting a named node only works for the root element; you cannot select child nodes by using just the node name. If you want to select child elements of a node within the XML document, you can provide a file-system-like path to that node, starting with the root element. For example, if we wanted to select all movie nodes, our XPath would look like this:

movies/movie

This can also be written with a leading slash:

/movies/movie

When starting with a fresh XML document, the leading slash makes no difference. This will come into play when we are searching a sub-section of an XML document. In that case, the leading slash is always an absolute path from the root element, just as a leading slash in a URL is always an absolute path to the web root (but, more on that later).

Running the following XPath on the root element:

 Launch code in new window » Download code as text file »

  • <!--- Select all the movie nodes. --->
  • <cfset arrNodes = XmlSearch(
  • xmlData,
  • "movies/movie"
  • ) />
  •  
  • <!--- Dump out resultant nodes. --->
  • <cfdump
  • var="#arrNodes#"
  • label="XPath: movies/movie"
  • />

... will result in the following CFDump output:


 
 
 

 
Using XPath In ColdFusion And XmlSearch()  
 
 
 

I have collapsed the XmlChildren so that you could easily see that each of the three movie nodes is returned as a separate index of the resultant nodes array. However, just so there is no confusion, the XmlChildren, at this point, do contain all the child node information for each of the movie nodes.

Also, our XPath can end in a trailing slash:

movies/movie/

This will result in the same exact node array; it is still selecting all movie nodes that are direct children of the movies node. To me, the trailing slash seems like more of a personal preference than anything else.

The XPath can be more than one level deep. Just like a file path, the XPath can search many levels of node nesting using the slash. The following XPath would select all movie genre tags:

movies/movie/genres/genre/

When we use that as our XPath:

 Launch code in new window » Download code as text file »

  • <!--- Select all the genre tags. --->
  • <cfset arrNodes = XmlSearch(
  • xmlData,
  • "movies/movie/genres/genre/"
  • ) />
  •  
  • <!--- Dump out resultant nodes. --->
  • <cfdump
  • var="#arrNodes#"
  • label="XPath: movies/movie/genres/genre/"
  • />

... we get the following CFDump:


 
 
 

 
Using XPath In ColdFusion And XmlSearch()  
 
 
 

As you can see, each genre tag is returned in its own index of the resultant node array. And, while we have not touched on this yet, it is important to know that nodes are added to the return array in the order in which they are encountered in the XML document. XPath / XmlSearch() searches the XML document in a depth-first, top-down approach such that XML nodes will be searched in the order that they appear in the original XML document.

Using long XPaths to get to deeply nested nodes can get burdensome. To help deal with this, you can use the // XPath construct. The // has two different behaviors depending on where it is use in the XPath. If you use it at the beginning of the path, it will find the trailing path on matter where it exists in the XML document. If you use // in the middle of an XPath, it does not require the trailing path to be a direct child of the leading path but rather any sort of descendant.

So for example, to get all genre nodes in our XML document (no matter where they are nested), our XPath could simply be:

//genre/

Therefore, running:

 Launch code in new window » Download code as text file »

  • <!--- Select all the genre tags. --->
  • <cfset arrNodes = XmlSearch(
  • xmlData,
  • "//genre/"
  • ) />

... will result in the exact same node array shown above.

The thing you have to be careful of here is that the path to the resultant nodes may not always be the same. For instance, if our XML document had books and author nodes, both of which contained child "name" nodes, the XPath //name would return all name nodes including those that were children of the book node or the author node.

When used in the middle of the XPath, // requires an ancestor-descendant relationship, but this does not need to be a direct parent-child relationship. For instance, the following XPath requires that our genre nodes be descendants of the movie tag, but not necessarily direct children:

movies/movie//genre/

In English, this is searching for all genre nodes that are some sort of descendant of all movie nodes that are the direct child of the movies root node. Therefore, running this code:

 Launch code in new window » Download code as text file »

  • <!--- Select all the genre tags. --->
  • <cfset arrNodes = XmlSearch(
  • xmlData,
  • "movies/movie//genre/"
  • ) />

... will also result in the exact same node array show above.

The // construct can be used in both capacities within the same XPath, so, for example, the following XPath:

//movie//genre/

... would find all genre nodes that are some sort of descendant of all the movie nodes which can be anywhere within the XML document.

Up until now, we have been doing all of searches using the root XML document. XmlSearch() can take, not only an XML document, but any XML node. For the next few examples, we are going to be using XmlSearch() with the first movie node in the XML document. In order to get that movie node, we are doing this:

 Launch code in new window » Download code as text file »

  • <!--- Get all of the movie tags. --->
  • <cfset arrMovieNodes = XmlSearch(
  • xmlData,
  • "movies/movie/"
  • ) />
  •  
  • <!--- Get the first returned movie node. --->
  • <cfset xmlMovie = arrMovieNodes[ 1 ] />

This is getting all movie nodes and then creating a short hand variable, xmlMovie, to point to the first XML node that was returned in the results array. Since XML nodes are passed around by reference, any pointer to that XML node is actually pointing to the XML node within the context of the original document.

Now that we have our first movie node, let's get all the genre nodes for that movie:

 Launch code in new window » Download code as text file »

  • <!--- Get all genre nodes for this movie. --->
  • <cfset arrNodes = XmlSearch(
  • xmlMovie,
  • "genres/genre/"
  • ) />
  •  
  • <!--- Dump out the resultant nodes. --->
  • <cfdump
  • var="#arrNodes#"
  • label="XPath: genres/genre/"
  • />

As you can see, we are passing in our movie XML node pointer to XmlSearch() (as opposed to the entire XML document). Additionally, our XPath is now relevant to the current node, NOT to the root node. Running the above code, we get the following CFDump output:


 
 
 

 
Using XPath In ColdFusion And XmlSearch()  
 
 
 

I am collapsing the last 3 genre nodes so you can see the full CFDump, but it is clear that we are only getting the 4 genre nodes that are descendants of the first movie node (the one we passed into XmlSearch()).

Now that we are searching relevant to sub-node of the XML document, understanding the leading / and // constructs becomes more important. A leading / starts a path that is always relevant to the root node. Therefore, even though we are searching a sub-node, the XPath:

/movies/movie/

... will still return every movie node in our XML document even though they are not descendants of the first movie node. Likewise, the XPath:

//genre/

... will still return every genre node in our XML document, not just the 4 descendants of our first movie node.

XPath can also traverse node relationships using the standard path constructs "./" and "../". ./ just refers to the current node, not that special. ../ refers to the parent node of the current node. Therefore, if we wanted to select all the movie nodes starting from the first movie node, we could use the XPath:

../movie/

This would go up one node to the parent node (movies) and then select all movie nodes that are its child. Therefore, running this code:

 Launch code in new window » Download code as text file »

  • <!---
  • Get the all the movie nodes that are sibling
  • to this node (including itself).
  • --->
  • <cfset arrNodes = XmlSearch(
  • xmlMovie,
  • "../movie/"
  • ) />
  •  
  • <!--- Dump out the resultant nodes. --->
  • <cfdump
  • var="#arrNodes#"
  • label="XPath: ../movie/"
  • />

... will result in the following node array:


 
 
 

 
Using XPath In ColdFusion And XmlSearch()  
 
 
 

While I have collapsed some of the children, you can see that all three movie nodes were returned from a search that was relevant only to the first movie node we passed in.

If you ever want to return all nodes at a given level, regardless of their name, you can use the wild card, *. If we wanted to get all child nodes of the movie element, including name, releasedate, and genres, we could use the XPath:

//movie/*/

This will return 9 nodes (3 child nodes for each of the 3 movie elements).

Up till now, we have been searching for element (tag) nodes. However, remember that just about everything in the XML document is a node of some type and therefore we can search for it. By using the @ symbol, we can search for attribute nodes in just about the same way. Each of our movie nodes has an attribute, dateadded. If we wanted to return all those attribute nodes, we could use the XPath:

movies/movie/@dateadded/

This would select all "dateadded" attribute nodes that are a child of all movie nodes that are a child of the movies root element node. Running the following code:

 Launch code in new window » Download code as text file »

  • <!---
  • Get all date dateadded attribute nodes
  • that are children of the movie nodes.
  • --->
  • <cfset arrNodes = XmlSearch(
  • xmlData,
  • "movies/movie/@dateadded/"
  • ) />
  •  
  • <!--- Dump out the resultant nodes. --->
  • <cfdump
  • var="#arrNodes#"
  • label="XPath: movies/movie/@dateadded/"
  • />

... we get the following CFDump output:


 
 
 

 
Using XPath In ColdFusion And XmlSearch()  
 
 
 

As you can see, these node structures are different than the element node structures that we got before, but they get returned in a results array just the same.

Just as with the element nodes, the wild card also works with attributes. If we wanted to select all attributes of the movie nodes, we could use the following XPath:

movies/movie/@*/

The @* is the wild card as it applies to attribute nodes. Therefore, running the code:

 Launch code in new window » Download code as text file »

  • <!---
  • Get all the attribute nodes that are children
  • of the movie nodes.
  • --->
  • <cfset arrNodes = XmlSearch(
  • xmlData,
  • "movies/movie/@*/"
  • ) />
  •  
  • <!--- Dump out the resultant nodes. --->
  • <cfdump
  • var="#arrNodes#"
  • label="XPath: movies/movie/@*/"
  • />

... we get the following CFDump output:


 
 
 

 
Using XPath In ColdFusion And XmlSearch()  
 
 
 

Now that we understand the basic XPath structures, we can start to make things more dynamic and powerful. Instead of returning all nodes at a given path, we can start to narrow down our search results based on tag and attribute properties. To do so, we can use Predicates. Predicates are always enclosed in square brackets and come directly after the node that they are modifying.

Since we should have a solid understanding of both the results array and the XML document structure, I am going to show less code and CFDump outputs and concentrate mostly on the XPath syntax itself.

We can get a node based on its position within the node set. For example, if we wanted to get the first movie node, we could use the XPath:

movies/movie[ 1 ]/

Here, the 1 inside of the brackets is the index of the node. To get the second movie node, we could have used 2, and 3 for the third node and so-on.

If we wanted to get the last movie node, we could use the built-in last() function and the XPath:

movies/movie[ last() ]/

If we wanted to get the second to last movie node, we could use some math in conjunction with the last() function:

movies/movie[ last() - 1 ]/

If we wanted to get all movies before or after a given position, we could use the built-in position() function. The position() function is a contextual function that returns the index of the node currently being examined. The following XPath will get all movie nodes that are not first:

movies/movie[ position() > 1 ]/

These built-in functions are very cool and what's cooler is that they can be used in conjunction with each other. If we wanted to get all movie nodes except for the last one, we could get all nodes whose position is not the last index. This XPath would look like this:

movies/movie[ position() != last() ]/

Now, we have to be careful here. The results of the movie nodes can be a bit misleading since there is only one set of movie nodes in our XML document. The positions above are relevant to the node position as it falls within its sibling set of the same node name. The positions are NOT relevant to the results array returned by XmlSearch().

This can be more clearly explained when we look at the genre tags. Each movie has its own set of genre tags so things get a bit more complicated. You might expect the following XPath:

//genre[ 1 ]/

... to return an array with only one node - the first genre node encountered. This however, is not correct. The [1] here refers to the node's position amongst its siblings. Since each movie has its own genre nodes, the above XPath will result in array that has three nodes. Each of the three nodes will be the first genre node found within each movie node. It might help to think of the predicates as being the HAVING clause of a GROUP BY in SQL. In that case, the HAVING condition modifies the group, not the overall result set.

We can also select nodes based on attribute properties. Just like the node positional filters, attribute predicates also go in the square brackets. And, just like selecting attribute nodes, attribute predicates are denoted by the @ symbol. With the attribute filtering, we can check for attribute existence as well as attribute values.

If we wanted to select all movie nodes that have any attribute, our XPath would be:

//movie[ @* ]/

In this case, we don't care what the attributes are or how many are in the node, we just want to get nodes that have at least one attribute.

If we wanted to select all movie nodes that have the imdbtitle attribute, our XPath would be:

//movie[ @imdbtitle ]/

Again, here we are not caring about the value of the attribute, we only care that the attribute exists within all matching movie tags. This would return all three movie nodes since all three movie nodes have the imdbtitle attribute.

If we wanted to select all movie nodes that have the imdbtitle attribute with a given value, our XPath would be:

//movie[ @imdbtitle = 'tt0265349' ]/

This will return just the single movie node for The Mothman Prophecies.

All of the above predicates have been filtering nodes based on properties of the current node and its attributes. Predicates, however, can also refer to the current node's descendants and their properties. In fact, you can even have nested predicates.

If we wanted to get all movie nodes that have at least one genre node, our XPath would be:

//movie[ genres/genre ]/

Here, we are saying that we want all movie nodes, no matter where they are in the XML document, but only so long as they have a direct child genres node that, itself, has as a direct genre child node. This of course, will return all three movies since they all have nested genre tags.

You will notice that the predicate path does NOT end in a slash. Earlier, I said that to select a node using XPath, the trailing slash was optional. This is not the case for predicate paths; a predicate path cannot end in a trailing slash - otherwise, ColdFusion would be expecting another navigational element (ie. a nested node name) on which to validate.

We just checked general descendant existence, but what if we wanted to get only the movie nodes that are in the Action genre? This is where some cool, nested predicates come into play:

//movie[ genres/genre[ text() = 'Action' ] ]/

Here, we are saying that we only want the movie nodes that have a descendant genre tag whose Text node is "Action". This, of course, will return an array with only one move node - A History of Violence.

We could even create compound tests. If we wanted to get all movies that are classified as EITHER Action or Drama, our XPath would be:

//movie[ genres/genre[ ( text() = 'Action' ) or ( text() = 'Drama' ) ] ]/

Are you beginning to see how powerful XPath can be? There is much more that can be done with XPath, especially when it is applied to things like XSLT. But for the purposes of XPath for use within ColdFusion's XmlSearch() function, I think this is a pretty good beginning.

Download Code Snippet ZIP File

Comments (5)  |  Post Comment  |  Ask Ben  |  Permalink  |  Other Searches  |  Print Page




I'm Too Young For This!

Reader Comments

Great stuff. I've done quite a bit of XML manipulation in CF. However, I've run into performance issues with large (20mb+-) files. It could be the server CF is on, but I have started using our database (Oracle) to do the parsing of these files.

Posted by Frank Wheatley on Jul 13, 2007 at 10:50 AM


@Frank,

Dang! A 20MB XML file :) That's huge. I can certainly imagine that taking a while to parse.

Posted by Ben Nadel on Jul 13, 2007 at 11:08 AM


thanks! this post helped me get ramped up on ColdFusion & XML in a hurry.

Posted by hibiscusroto (Chris Vigliotti) on Aug 14, 2007 at 10:16 AM


My pleasure. This stuff is pretty cool.

Posted by Ben Nadel on Aug 14, 2007 at 4:57 PM


Have you tested XPath searching of retrieved nodes in 8.1? e.g., in your example, you cfset xmlMovie = arrMovieNodes[ 1 ] , then proceed to cfset arrNodes = XmlSearch(xmlMovie, "genres/genre/"). This does not seem to work any longer in 8.1.

In my example, I am using http://sportsfeeds.bodoglife.com/basic/AFL.xml to get the Line nodes for each Competitor. cflooping through Event nodes and Competitor nodes, with the current one marked as "competitor", XmlSearch(competitor, "//Line") references the root node of the entire document - not the current competitor node as would be expected - hence retrieving a large array of Line nodes, every Line node in the document, instead of just the 1 or 2 beneath that competitor.

Posted by Nick Walters on May 19, 2008 at 3:04 PM


Post Comment  |  Ask Ben


Home   |   Web Log   |   ColdFusion   |   Projects   |   Resume   |   Job Form   |   Search   |   Contact
Epicenter Consulting - Custom Software Solutions for Business Evolution HostMySite.com - The Leader In ColdFusion Hosting