Skip to main content
Ben Nadel at cf.Objective() 2017 (Washington, D.C.) with: Valerie Poreaux
Ben Nadel at cf.Objective() 2017 (Washington, D.C.) with: Valerie Poreaux ( @valerieporeaux )

Using ColdFusion To Capture Form Data And Then Submitting The Form

By on
Tags:

Earlier today, I talked about helping someone grab HTML form data and then resubmitting it with the existing form data. As an introduction to that, I talked about taking an HTML tag and parsing it into a ColdFusion structure. Now, we are going to build on that and actually grab the forms out of a page, parse the inputs, and resubmit the data with a combination of existing form fields and our own form field data. This demo does not cover all aspects of form scrapping, nor does it cover maintaining sessions across CFHttp calls, but it should be sufficient to give some direction.

Just a reminder from the previous post, we are going to use the ColdFusion user defined function, ParseHTMLTag(), to take HTML tag data and create a ColdFusion structure:

<cffunction
	name="ParseHTMLTag"
	access="public"
	returntype="struct"
	output="false"
	hint="Parses the given HTML tag into a ColdFusion struct.">

	<!--- Define arguments. --->
	<cfargument
		name="HTML"
		type="string"
		required="true"
		hint="The raw HTML for the tag."
		/>

	<!--- Define the local scope. --->
	<cfset var LOCAL = StructNew() />

	<!--- Create a structure for the taget tag data. --->
	<cfset LOCAL.Tag = StructNew() />

	<!--- Store the raw HTML into the tag. --->
	<cfset LOCAL.Tag.HTML = ARGUMENTS.HTML />

	<!--- Set a default name. --->
	<cfset LOCAL.Tag.Name = "" />

	<!---
		Create an structure for the attributes. Each
		attribute will be stored by it's name.
	--->
	<cfset LOCAL.Tag.Attributes = StructNew() />


	<!---
		Create a pattern to find the tag name. While it
		might seem overkill to create a pattern just to
		find the name, I find it easier than dealing with
		token / list delimiters.
	--->
	<cfset LOCAL.NamePattern = CreateObject(
		"java",
		"java.util.regex.Pattern"
		).Compile(
			"^<(\w+)"
			)
		/>

	<!--- Get the matcher for this pattern. --->
	<cfset LOCAL.NameMatcher = LOCAL.NamePattern.Matcher(
		ARGUMENTS.HTML
		) />

	<!---
		Check to see if we found the tag. We know there
		can only be ONE tag name, so using an IF statement
		rather than a conditional loop will help save us
		processing time.
	--->
	<cfif LOCAL.NameMatcher.Find()>

		<!--- Store the tag name in all upper case. --->
		<cfset LOCAL.Tag.Name = UCase(
			LOCAL.NameMatcher.Group( 1 )
			) />

	</cfif>


	<!---
		Now that we have a tag name, let's find the
		attributes of the tag. Remember, attributes may
		or may not have quotes around their values. Also,
		some attributes (while not XHTML compliant) might
		not even have a value associated with it (ex.
		disabled, readonly).
	--->
	<cfset LOCAL.AttributePattern = CreateObject(
		"java",
		"java.util.regex.Pattern"
		).Compile(
			"\s+(\w+)(?:\s*=\s*(""[^""]*""|[^\s>]*))?"
			)
		/>

	<!--- Get the matcher for the attribute pattern. --->
	<cfset LOCAL.AttributeMatcher = LOCAL.AttributePattern.Matcher(
		ARGUMENTS.HTML
		) />


	<!---
		Keep looping over the attributes while we
		have more to match.
	--->
	<cfloop condition="LOCAL.AttributeMatcher.Find()">

		<!--- Grab the attribute name. --->
		<cfset LOCAL.Name = LOCAL.AttributeMatcher.Group( 1 ) />

		<!---
			Create an entry for the attribute in our attributes
			structure. By default, just set it the empty string.
			For attributes that do not have a name, we are just
			going to have to store this empty string.
		--->
		<cfset LOCAL.Tag.Attributes[ LOCAL.Name ] = "" />

		<!---
			Get the attribute value. Save this into a scoped
			variable because this might return a NULL value
			(if the group in our name-value pattern failed
			to match).
		--->
		<cfset LOCAL.Value = LOCAL.AttributeMatcher.Group( 2 ) />

		<!---
			Check to see if we still have the value. If the
			group failed to match then the above would have
			returned NULL and destroyed our variable.
		--->
		<cfif StructKeyExists( LOCAL, "Value" )>

			<!---
				We found the attribute. Now, just remove any
				leading or trailing quotes. This way, our values
				will be consistent if the tag used quoted or
				non-quoted attributes.
			--->
			<cfset LOCAL.Value = LOCAL.Value.ReplaceAll(
				"^""|""$",
				""
				) />

			<!---
				Store the value into the attribute entry back
				into our attributes structure (overwriting the
				default empty string).
			--->
			<cfset LOCAL.Tag.Attributes[ LOCAL.Name ] = LOCAL.Value />

		</cfif>

	</cfloop>


	<!--- Return the tag. --->
	<cfreturn LOCAL.Tag />
</cffunction>

Now, we are going to use that function from within a new ColdFusion user defined function, GetPageForms(). This function will iterate over the Forms in a target page (given the URL or its actual HTML content) and will parse each form into a ColdFusion structure then return each form object in an array:

<cffunction
	name="GetPageForms"
	access="public"
	returntype="array"
	output="false"
	hint="Takes a URL or page content and parsed the forms and form fields.">

	<!--- Define arguments. --->
	<cfargument
		name="HTML"
		type="string"
		required="true"
		hint="Page HTML or URL to page with the target HTML."
		/>

	<!--- Define the local scope. --->
	<cfset var LOCAL = StructNew() />


	<!---
		Check to see if we are dealing with page content or
		a target url. For our purposes, if the text is a valid
		URL then we are going to assume that this is NOT the
		page data.
	--->
	<cfif IsValid( "url", ARGUMENTS.HTML )>

		<!--- We are going to grab the URL file content. --->
		<cfhttp
			url="#ARGUMENTS.HTML#"
			method="get"
			useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4"
			resolveurl="true"
			result="LOCAL.HttpGet">

			<!---
				Pass in the referrer. This is just to help
				ensure that we are served the proper page data.
			--->
			<cfhttpparam
				type="CGI"
				name="referer"
				value="#GetDirectoryFromPath( ARGUMENTS.HTML )#"
				/>

		</cfhttp>


		<!---
			Store the returned file content back into our
			HTML argument so that we can treat it uniformly
			going forward.
		--->
		<cfset ARGUMENTS.HTML = LOCAL.HttpGet.FileContent />

	</cfif>


	<!---
		ASSERT: At this point, whether we were given page
		content or a URL, we now have page HTML in our
		HTML argument. The HTML may not be valid (200 OK)
		response, or it might.
	--->


	<!---
		Create our return array to hold the form data.
		Each form found on the page will be a different
		index in this array.
	--->
	<cfset LOCAL.Forms = ArrayNew( 1 ) />


	<!---
		Create a pattern to search for the forms. This
		will start with the open form tag, then grab all
		the content before the close form tag, and then the
		close form tag.
	--->
	<cfset LOCAL.FormPattern = CreateObject(
		"java",
		"java.util.regex.Pattern"
		).Compile(
			<!--- Open form tag. --->
			"(?i)(<form" &

			<!--- Form tag attributes. --->
			"(?:\s+\w+(?:\s*=\s*(?:""[^""]*""|[^\s>]*))?)*" &

			<!--- Close bracket of form tag. --->
			"[^>]*>)" &

			<!---
				Form content. Here, we are doing a non greedy
				search for any chacacter until we match the
				close form tag.
			--->
			"([\w\W]*?)" &

			<!--- Close form tag. --->
			"</form[^>]*>"
			)
		/>

	<!--- Get the matcher for our form pattern. --->
	<cfset LOCAL.FormMatcher = LOCAL.FormPattern.Matcher(
		ARGUMENTS.HTML
		) />


	<!---
		Keep looping over the form matcher while there
		are forms to parse in the target HTML.
	--->
	<cfloop condition="LOCAL.FormMatcher.Find()">

		<!---
			Create a structure to store this form instance. We
			are going to capture the form tag information, the
			raw form content and the form inputs.
		--->
		<cfset LOCAL.Form = StructNew() />

		<!--- Create an array to capture the inputs. --->
		<cfset LOCAL.Form.Fields = ArrayNew( 1 ) />

		<!--- Parse the form tag data. --->
		<cfset LOCAL.Form.Tag = ParseHTMLTag(
			LOCAL.FormMatcher.Group( 1 )
			) />

		<!--- Store the raw content. --->
		<cfset LOCAL.Form.HTML = LOCAL.FormMatcher.Group() />


		<!---
			Now, let's find the inputs. These are not just the
			INPUT tags, but also textareas and select fields.
			Create a pattern to find the field tags. Now, the
			selects and the textareas are not going to have
			such nice name and value attributes (like hidden
			form fields do), but to keep this simple, I am just
			going to grab the open tags for these form fields.
		--->
		<cfset LOCAL.FieldPattern = CreateObject(
			"java",
			"java.util.regex.Pattern"
			).Compile(
				<!--- The tag name. --->
				"(?i)<(input|select|textarea)" &

				<!--- The tag attributes. --->
				"(?:\s+\w+(?:\s*=\s*(?:""[^""]*""|[^\s>]*))?)*" &

				<!--- The close tag. --->
				"[^>]*>"
				)
			/>

		<!--- Get the pattern matcher for the form fields. --->
		<cfset LOCAL.FieldMatcher = LOCAL.FieldPattern.Matcher(
			LOCAL.Form.HTML
			) />

		<!---
			Keep looping over the field matcher while there
			are inputs left to parse in the target form.
		--->
		<cfloop condition="LOCAL.FieldMatcher.Find()">

			<!---
				Add this input to the array. As we add
				this field entry, parse the HTML tag into a
				ColdFusion structure.
			--->
			<cfset ArrayAppend(
				LOCAL.Form.Fields,
				ParseHTMLTag(
					LOCAL.FieldMatcher.Group( 0 )
					)
				) />

		</cfloop>


		<!---
			Now that we have captured all the information
			about this form that we can, add this form to
			the results array.
		--->
		<cfset ArrayAppend( LOCAL.Forms, LOCAL.Form ) />

	</cfloop>


	<!--- Return the form data. --->
	<cfreturn LOCAL.Forms />
</cffunction>

There's a lot going on in that function. Basically, it creates patterns for both the form tags and the nested form field tags and then uses ParseHTMLTag() to parse each into a usable ColdFusion structure. The algorithm will parse Select and Textarea tags, but these are not as easy to use. For our purposes, in order to keep this demo as simple as possible, we are just going to grab the opening tag of the Select and Textarea inputs. As it turns out, this won't cause too many problems as we are going to demo this on a form that only has input fields and buttons.

And, for that demo, we are going to submit a keyword search on the Flickr.com homepage. To start off, let's just grab the forms off of Flickr.com using our GetPageForms() method. This method can take either a URL or actual HTML page content. Since we need to get the page content anyway, we might as well just send in the Flickr.com url:

<!---
	Let's get the form off of the Flickr.com homagepage.
	This should be the search form. We could do the CFHttp
	ourselves, but the GetPageForms() function will do this
	for us if we pass in a URL (instead of page content).
--->
<cfset arrForms = GetPageForms(
	"http://www.flickr.com/"
	) />


<!--- Dump out the Flickr.com form data. --->
<cfdump
	var="#arrForms#"
	label="Flickr.com Form Data"
	/>

When we run that, GetPageForms() is performing a CFHttp to get the Flickr.com page data. Then, it is parsing the resultant page content and will return an array of the Form objects on that page. Running that, we get the following CFDump output:

Flickr.com Form Scrapping With ColdFusion

As you can see, the Flickr.com homepage search form is quite simple; it has the search button, the search criteria, and a small form tag. Now that we have that, we are going to mimic the form submission using ColdFusion's CFHttp tag. We have to be careful when doing this; the field we really care about is the "Q" field, for the search criteria. We don't want to end up submitting this twice, so when we mimic our form fields using CFHttpParam, we have to be careful to customize that one, rather than just echoing it back.

<!---
	Get the form that we are referring to. In theory, we could
	have multiple form here, but I know that it is the first
	one, so that is the one I am going to grab.
--->
<cfset objForm = arrForms[ 1 ] />


<!---
	Post the search request to Flickr.com. When we do this,
	we are going to post the fields that Flickr.com already
	has there, but instead of posting the search criteria,
	we are going to post that one custom.
--->
<cfhttp
	url="#objForm.Tag.Attributes.Action#"
	method="#objForm.Tag.Attributes.Method#"
	useragent="#CGI.http_user_agent#"
	resolveurl="true"
	result="objGet">

	<!---
		Now, let's loop over the form field that we got
		back in our form data.
	--->
	<cfloop
		index="intField"
		from="1"
		to="#ArrayLen( objForm.Fields )#"
		step="1">

		<!--- Get a short hand to the current field. --->
		<cfset objField = objForm.Fields[ intField ] />

		<!---
			Check to see if we are dealing with a form field
			of some kind (remember, we might be dealing with
			input, selects, or textareas). Not all of those
			will have name and value, but for our purposes
			and to keep this demo simple, I am just going to
			include the hidden and standard inputs.
		--->
		<cfif (
			StructKeyExists( objField.Attributes, "Name" ) AND
			StructKeyExists( objField.Attributes, "Value" )
			)>

			<!---
				We wand to include the form fields, however,
				if this is the "Q" input, then we want to put
				in our own data, not the existing form data.
			--->
			<cfif (objField.Attributes.Name EQ "q")>

				<!---
					For the Q field, we are going to send in
					our own search criteria.
				--->
				<cfhttpparam
					type="FORMFIELD"
					name="q"
					value="Sexy Smile"
					/>

			<cfelse>

				<!---
					Just include the form field as it already
					existed in the returned form data.
				--->
				<cfhttpparam
					type="FORMFIELD"
					name="#objField.Attributes.Name#"
					value="#objField.Attributes.Value#"
					/>

			</cfif>

		</cfif>

	</cfloop>

</cfhttp>


<!---
	Now, assuming that everything properly, we should have
	the resultant page request content in the File Content
	variable of our CFHttp result. Let's output it, rather
	than CFDumping it so that we can actually see the
	displayed content.
--->
#objGet.FileContent#

As you can see, we are merely looping over the form fields returned from the GetPageForms() and echoing them back in our form submission. It is slightly complicated because we have to treat the Q field specially. However, we could have possibly simplified the process by actually altering the form structure data before we looped over it (update the objField.Attributes.Value attribute for the Q field before we iterated over the fields); then, we could have just treated all the form fields uniformly.

When we run the above code, we output the returned Flickr.com data directly into our page so that it will render properly:

Flickr.com Form Data Submission With ColdFusion

Worked like a charm. I have pointed out where our search criteria is echoed back in the Flickr.com form. And, again, things get more complicated if you want to really deal with Select and Textarea inputs. But, for a simple demo like this, I wanted to try and keep it as simple as possible.

Want to use code from this post? Check out the license.

Reader Comments

15,640 Comments

I know the color coding is getting messed up (from the HTML tags in the quoted arguments). I am working on fixing that. Thanks for your patience.

42 Comments

Quick question on your regex.

Is that going to capture the tags that are using single quotes? From a quick read it looks like you're just testing for double quotes. But then again, sometimes when I read some of you're regex it makes my poor little brain hurt. :\

15,640 Comments

@Dustin,

You are right. I did not check for single quotes. I totally forgot that people even use them :) I think you could update part of the regex:

(?:""[^""]*""|[^\s>]*)

To be:

(?:""[^""]*""|'[^']*'||[^\s>]*)

.... at least I think. This should handle both types of quotes.

32 Comments

fun with regex IDE

(?:""[^""]*""|'[^']*'||[^\s>]*)

when I drop your single quote or double quote regex into Expresso and click to the analyzer it crashes! woohoo. . .probably just a typo in the regex. . .

15,640 Comments

I accidentally put a double pipe in there :

||

It might be messing it up. The double pipe should just be single pipe:

|

15,640 Comments

Oh, and also, I have double quotes ("") as an escaped quote within the ColdFusion code. If you run this in a RegEx engine, you don't need to escape the quotes:

"" becomes just "

2 Comments

is there a particular reason why you keep redundant information in the array shown by cfdump ? I can't see the need for any of the 'HTML' fields.

15,640 Comments

@Jax,

I had it in as a debugging mechanism as I was building the script. And then, I just left it in. But you are correct, it does not serve a real purpose. I suppose if you were messing with AJAXy type stuff, you could use it for some innerHTML work, but that was not my intent.

2 Comments

Ben - thx for your blogs, this topic is plaguing me right now - and THIS post comes close to my needs - but it seems to have a stumbling point for me...

I want to parse 'whole' files, such as parsing an htm file and look for 'deprecated' tags or some such - I don't want to be 'limited' to finding a '<form' tag...

the java objects are foreign to me and I cant seem to modify your code to get what I need - might you consider doing a broader function for an example?

1 Comments

Great tutorial, but flicker prevents this from working now.

"We're sorry, Flickr doesn't allow embedding within frames."

I believe in love. I believe in compassion. I believe in human rights. I believe that we can afford to give more of these gifts to the world around us because it costs us nothing to be decent and kind and understanding. And, I want you to know that when you land on this site, you are accepted for who you are, no matter how you identify, what truths you live, or whatever kind of goofy shit makes you feel alive! Rock on with your bad self!
Ben Nadel