Earlier today, I talked about helping someone grab HTML form data and then resubmitting it with the existing form data. As an introduction to that, I talked about taking an HTML tag and parsing it into a ColdFusion structure. Now, we are going to build on that and actually grab the forms out of a page, parse the inputs, and resubmit the data with a combination of existing form fields and our own form field data. This demo does not cover all aspects of form scrapping, nor does it cover maintaining sessions across CFHttp calls, but it should be sufficient to give some direction.
Just a reminder from the previous post, we are going to use the ColdFusion user defined function, ParseHTMLTag(), to take HTML tag data and create a ColdFusion structure:
Launch code in new window » Download code as text file »
Now, we are going to use that function from within a new ColdFusion user defined function, GetPageForms(). This function will iterate over the Forms in a target page (given the URL or its actual HTML content) and will parse each form into a ColdFusion structure then return each form object in an array:
Launch code in new window » Download code as text file »
There's a lot going on in that function. Basically, it creates patterns for both the form tags and the nested form field tags and then uses ParseHTMLTag() to parse each into a usable ColdFusion structure. The algorithm will parse Select and Textarea tags, but these are not as easy to use. For our purposes, in order to keep this demo as simple as possible, we are just going to grab the opening tag of the Select and Textarea inputs. As it turns out, this won't cause too many problems as we are going to demo this on a form that only has input fields and buttons.
And, for that demo, we are going to submit a keyword search on the Flickr.com homepage. To start off, let's just grab the forms off of Flickr.com using our GetPageForms() method. This method can take either a URL or actual HTML page content. Since we need to get the page content anyway, we might as well just send in the Flickr.com url:
Launch code in new window » Download code as text file »
When we run that, GetPageForms() is performing a CFHttp to get the Flickr.com page data. Then, it is parsing the resultant page content and will return an array of the Form objects on that page. Running that, we get the following CFDump output:
| | | | ||
| | ![]() | | ||
| | | |
As you can see, the Flickr.com homepage search form is quite simple; it has the search button, the search criteria, and a small form tag. Now that we have that, we are going to mimic the form submission using ColdFusion's CFHttp tag. We have to be careful when doing this; the field we really care about is the "Q" field, for the search criteria. We don't want to end up submitting this twice, so when we mimic our form fields using CFHttpParam, we have to be careful to customize that one, rather than just echoing it back.
Launch code in new window » Download code as text file »
As you can see, we are merely looping over the form fields returned from the GetPageForms() and echoing them back in our form submission. It is slightly complicated because we have to treat the Q field specially. However, we could have possibly simplified the process by actually altering the form structure data before we looped over it (update the objField.Attributes.Value attribute for the Q field before we iterated over the fields); then, we could have just treated all the form fields uniformly.
When we run the above code, we output the returned Flickr.com data directly into our page so that it will render properly:
| | | | ||
| | ![]() | | ||
| | | |
Worked like a charm. I have pointed out where our search criteria is echoed back in the Flickr.com form. And, again, things get more complicated if you want to really deal with Select and Textarea inputs. But, for a simple demo like this, I wanted to try and keep it as simple as possible.
Download Code Snippet ZIP File
Comments (8) | Post Comment | Ask Ben | Permalink | Other Searches | Print Page
I know the color coding is getting messed up (from the HTML tags in the quoted arguments). I am working on fixing that. Thanks for your patience.
Posted by Ben Nadel on Jun 18, 2007 at 3:18 PM
Quick question on your regex.
Is that going to capture the tags that are using single quotes? From a quick read it looks like you're just testing for double quotes. But then again, sometimes when I read some of you're regex it makes my poor little brain hurt. :\
Posted by Dustin on Jun 18, 2007 at 3:54 PM
@Dustin,
You are right. I did not check for single quotes. I totally forgot that people even use them :) I think you could update part of the regex:
(?:""[^""]*""|[^\s>]*)
To be:
(?:""[^""]*""|'[^']*'||[^\s>]*)
.... at least I think. This should handle both types of quotes.
Posted by Ben Nadel on Jun 18, 2007 at 3:58 PM
fun with regex IDE
(?:""[^""]*""|'[^']*'||[^\s>]*)
when I drop your single quote or double quote regex into Expresso and click to the analyzer it crashes! woohoo. . .probably just a typo in the regex. . .
Posted by macbuoy on Jun 18, 2007 at 5:50 PM
I accidentally put a double pipe in there :
||
It might be messing it up. The double pipe should just be single pipe:
|
Posted by Ben Nadel on Jun 18, 2007 at 5:54 PM
Oh, and also, I have double quotes ("") as an escaped quote within the ColdFusion code. If you run this in a RegEx engine, you don't need to escape the quotes:
"" becomes just "
Posted by Ben Nadel on Jun 18, 2007 at 5:55 PM
is there a particular reason why you keep redundant information in the array shown by cfdump ? I can't see the need for any of the 'HTML' fields.
Posted by Jax on Jun 19, 2007 at 3:04 AM
@Jax,
I had it in as a debugging mechanism as I was building the script. And then, I just left it in. But you are correct, it does not serve a real purpose. I suppose if you were messing with AJAXy type stuff, you could use it for some innerHTML work, but that was not my intent.
Posted by Ben Nadel on Jun 19, 2007 at 7:13 AM