I am in the middle of helping someone use ColdFusion's CFHttp tag to submit form data. In the process of doing this, we needed to be able grab the form fields from a target URL, parse them out, and then submit our own form values. As a result, it became important to be able to parse individual HTML tags into ColdFusion structures so that we could read in the attributes and then use them in our ColdFusion CFHttpParam tags.
When it comes to parsing an HTML tag in ColdFusion, there are two things we are looking for:
Both of these things follow textual patterns that can be parsed using regular expressions. The tag name is quite easy - it's simply the first "word" that comes after the open bracket of the tag. The attributes, on the other hand, are a little bit more challenging. While it is XHTML compliant to use quoted attributes, looking at people's source code, you will notice that not everyone uses quotes. And, to complicate things even more, people will even use a mixture of quoted and non-quoted attributes. Therefore, we need to be able to handle both situations.
The following ColdFusion user defined function (UDF), ParseHTMLTag(), uses some nifty regular expressions (Steve, I did my best :)) to allow for both attributes types. It parses the given HTML tag and returns a structure with the keys:
HTML - The raw HTML that was passed in (echoed back).
Name - The tag name (ex. Input, H1, Textarea).
Attributes - A structure containing the name-value attribute pairs. Each name-value pair is stored in the structure by its attribute name.
Here is the ColdFusion UDF, ParseHTMLTag():
Launch code in new window » Download code as text file »
To test it, I am going to build an HTML Input tag that has line returns, quoted attributes, non-quoted attributes, and attributes that have no value:
Launch code in new window » Download code as text file »
Running the above, we get the following CFDump output:
| | | | ||
| | ![]() | | ||
| | | |
As you can see, the attributes that had no value (disabled) were stored with an empty string. Since all struct members need a value, this is the best default we can give it. Additionally, since ColdFusion treats NULL values as empty strings, this keeps in line with that idea (as a non-existent attribute value can be thought of as null).
This ColdFusion UDF is just part of a smaller form parsing algorithm that I am working on. Hopefully, I will post that up when it is done.
Download Code Snippet ZIP File
Comments (8) | Post Comment | Ask Ben | Permalink | Other Searches | Print Page
The Miracle Fruit (Miraculous Berry) Turns Sour To Sweet
Learning ColdFusion 8: CFImage Part III - Watermarks And Transparency
This will come in handy. Thanks dude!
Posted by Boyan on Jun 18, 2007 at 11:35 AM
@Boyan,
Hopefully, I can get the next part up at lunch - the algorithm that actually uses this as a sub-function. But, I might need to flesh that one out a bit more.
Posted by Ben Nadel on Jun 18, 2007 at 11:39 AM
@ben - It might be overkill for what you are doing, but have you thought about using jTidy or something of the like to parse the HTML instead of using regex?
It makes it super easy once your content is converted to it's DOM to pick off target elements and their attributes, and its fairly performant. I've been using it on on my site as the primary parsing strategy (with a regular expression failover when jTidy can't convert the page to it's object model) for finding links pointing to mp3s.
This post has some info on using it.
http://jeffcoughlin.com/?pg=9&fn=3&id=1
Posted by Justin on Jun 18, 2007 at 1:12 PM
@Justin,
I have not heard of jTidy before. I will definitely check it out. One of the nice things about encapsulating this functionality is that I can swap out sub-components and it should be all good.
Thanks for bringing jTidy to my attention.
Posted by Ben Nadel on Jun 18, 2007 at 1:18 PM
"(Steve, I did my best :))"
Very funny. You are, of course, rather masterful with regexes yourself. ;-)
One thing you might also want to account for is single-quoted attribute values. Unlike unquoted values (which are valid in HTML4, but not XHTML), I believe single-quoted values are valid even in XHTML.
Posted by Steve on Jun 18, 2007 at 2:42 PM
@Steve,
Oh crap, I totally forgot about single quotes! D'oh!
As far as all the regular expression stuff to grab the attributes, I think I learned most of that from looking at your regular expression (esp. for nested patterns). Obviously, not exactly the same, but inspiring nonetheless.
Posted by Ben Nadel on Jun 18, 2007 at 2:51 PM
"Oh crap, I totally forgot about single quotes! D'oh!"
. . .and don't forget about nesting of quoted, using a nesting of single within double or double within single such as
value="david 'the big bad' dad"
or
value='devin "little man" his son'
will your regex and UDF get these right?
does anyone know where XHTML comes down on this sort of thing? are you supposed to escape certain characters or something like that?
Posted by macbuoy on Jun 18, 2007 at 5:42 PM
@Macbuoy,
The nesting of quotes should be ok because when ever it comes to a double quote, it searches for [^"] which will search until it finds another double quote (allowing single quotes to be part of that data)... and of course, vise-versa for the double quotes within single quotes.
As far as the XHTML stuff, I am not sure of. I use double-quotes for all my attributes, that much I can tell you.
Posted by Ben Nadel on Jun 18, 2007 at 5:53 PM