Screen-Scraping Movie Showtimes Off Google.com With ColdFusion

By Ben Nadel

Published 2010-03-26 in ColdFusion — Comments (8)

Yesterday, I experimented with scraping movie showtimes off of the iPhone version of Fandango.com. Today, I wanted to try and do the same thing with the Google.com movie showtimes service. This actually provides an interesting context because it's two very different approaches to the same problem. With Fandango.com, we get XHTML that is so compliant that we can actually parse it into XML and use XPath to query the document. Google, on the other hand, is so conscious about bandwidth usage that they make their HTML as dirty and as incomplete as possible so long as it still renders properly. As such, when we deal with Google's markup, we have to fall back to string parsing and pattern matching rather than DOM querying.

Because I was solving the same problem, I actually wanted to build the same API. So, for this demo, you'll see that the ColdFusion code is almost exactly the same as the code used in the Fandango.com demo:

  
          <!--- Create an instance of the Google movie component. --->
        
          <cfset google = createObject( "component", "Google" ).init() />
        
          <!---
        
          	Get the theater information for the showtimes at the
        
          	Regal Union Square Stadium 14 theater TODAY.
        
          	ID: 10dd19bd6f57c7c8 - Regal Union Square Stadium 14
        
          	ID: 14c321fe7754e274 - AMC Empire 25
        
          	NOTE: I had to get the theater ID off the website itself.
        
          --->
        
          <cfset theaterInfo = google.getTheaterInfo( "10dd19bd6f57c7c8" ) />
        
          <!--- Output theater information. --->
        
          <cfoutput>
        
          	<p>
        
          		<strong>#theaterInfo.title#</strong><br />
        
          	</p>
        
          	<!---
        
          		Loop over the movies to output the movie titles and
        
          		the times they are showing.
        
          	--->
        
          	<cfloop
        
          		index="movie"
        
          		array="#theaterInfo.movies#">
        
          		<p>
        
          			<strong>#movie.title#</strong><br />
        
          			#arrayToList( movie.showtimes, ", " )#
        
          		</p>
        
          	</cfloop>
        
          </cfoutput>

view raw code-1.cfm hosted with ❤ by GitHub

Pretty much, the only difference here is that I am instantiating a ColdFusion component called "Google" rather than one called "Fandango." Both of these CFCs have the same public API, which is the method, getTheaterInfo(). This method returns the same structure in both cases. This is the nicest thing about creating an API - that you can change the underlying engine without changing the code that relies on it.

When we run the above code, we get the following page output:

NOTE: Movie data removed at the request of data owner.

The ColdFusion component that powers this is somewhat less complex than the Fandango one because all the movies are listed on one page. In the Fandango version, I had to make several CFHTTP page requests to gather all of the showtime information; but on Google, it's all right there. Of course, this time, I have to rely on Regular Expression pattern matching rather than XPath; but it's not too much more complex.

Google.cfc

  
          <cfcomponent
        
          	output="false"
        
          	hint="I help screen scrape the Google Movie showtimes.">
        
          	<cffunction
        
          		name="init"
        
          		access="public"
        
          		returntype="any"
        
          		output="false"
        
          		hint="I return an initialized component.">
        
          		<!--- Define arguments. --->
        
          		<cfargument
        
          			name="baseURL"
        
          			type="string"
        
          			required="false"
        
          			default="http://google.com/movies"
        
          			hint="I am the base URL for the HTTP requests."
        
          			/>
        
          		<!--- Store properties. --->
        
          		<cfset this.baseURL = arguments.baseURL />
        
          		<!--- Return this object reference. --->
        
          		<cfreturn this />
        
          	</cffunction>
        
          	<cffunction
        
          		name="getTheaterInfo"
        
          		access="public"
        
          		returntype="struct"
        
          		output="false"
        
          		hint="I parse the showtimes for the given theater ID.">
        
          		<!--- Define arguments. --->
        
          		<cfargument
        
          			name="theaterID"
        
          			type="string"
        
          			required="true"
        
          			hint="I am the theater ID used by Fandango."
        
          			/>
        
          		<!--- Define the loacl scope. --->
        
          		<cfset var local = {} />
        
          		<!--- Define the theater structure. --->
        
          		<cfset local.theaterInfo = {
        
          			id = arguments.theaterID,
        
          			title = "",
        
          			movies = []
        
          			} />
        
          		<!---
        
          			Grab the HTML off of the Google web page. With Google,
        
          			you typically have to send some sort of User Agent
        
          			because it will block a lot of user agents that it
        
          			considers "bots."
        
          		--->
        
          		<cfhttp
        
          			result="local.googleGet"
        
          			method="get"
        
          			url="#this.baseURL#?tid=#arguments.theaterID#"
        
          			useragent="Mozilla/BenNadel.com"
        
          			/>
        
          		<!---
        
          			While the HTML of the Google page is horrendously
        
          			incomplete, it is thankfully well Classed enough to
        
          			make string parsing somewhat straightfoward.
        
          		--->
        
          		<!--- Grab the theater title div. --->
        
          		<cfset local.theaterDiv = reMatch(
        
          			"<div class=theater>[\w\W]+?</span>",
        
          			local.googleGet.fileContent
        
          			) />
        
          		<!---
        
          			Get the theater title by stripping out all tags from
        
          			the theater DIV. There is an H2 in there somewhere that
        
          			has our theater name.
        
          		--->
        
          		<cfset local.theaterInfo.title = trim(
        
          			reReplace(
        
          				local.theaterDiv[ 1 ],
        
          				"(&nbsp;|</?\w+[^>]*>)",
        
          				" ",
        
          				"all"
        
          				)
        
          			) />
        
          		<!--- Each movie is wrapped in a "movie" DIV that we can
        
          			extract with some regular expression matching.
        
          		--->
        
          		<cfset local.movieDivs = reMatch(
        
          			"<div class=movie>(?:\s|<(\w+)[^>]*>.+?</\1>)+",
        
          			local.googleGet.fileContent
        
          			) />
        
          		<!---
        
          			At this point, we have chunks of strings that contain
        
          			the movie data. Now, we have to loop over each one and
        
          			parse the details.
        
          		--->
        
          		<cfloop
        
          			index="local.movieDiv"
        
          			array="#local.movieDivs#">
        
          			<!--- Parse out the movie name DIV. --->
        
          			<cfset local.nameDiv = reMatch(
        
          				"<div class=name>.+?</div>",
        
          				local.movieDiv
        
          				) />
        
          			<!--- Parse out the showtimes DIV. --->
        
          			<cfset local.showtimesDiv = reMatch(
        
          				"<div class=times>.+?</div>",
        
          				local.movieDiv
        
          				) />
        
          			<!---
        
          				Create a movie struct from the parsed DIVs. For
        
          				this, we are basically going to take the pasred
        
          				DIVs and strip out all tags, leaving just the
        
          				textual data.
        
          			--->
        
          			<cfset local.movie = {
        
          				title = trim(
        
          					reReplace(
        
          						local.nameDiv[ 1 ],
        
          						"</?\w+[^>]*>",
        
          						" ",
        
          						"all"
        
          						)
        
          					),
        
          				showtimes = listToArray(
        
          					reReplace(
        
          						local.showtimesDiv[ 1 ],
        
          						"(&nbsp;|</?\w+[^>]*>)",
        
          						" ",
        
          						"all"
        
          						),
        
          					" "
        
          					)
        
          				} />
        
          			<!--- Append the movie to the ongoing collection. --->
        
          			<cfset arrayAppend(
        
          				local.theaterInfo.movies,
        
          				local.movie
        
          				) />
        
          		</cfloop>
        
          		<!--- Return the result. --->
        
          		<cfreturn local.theaterInfo />
        
          	</cffunction>
        
          </cfcomponent>

view raw code-2.cfm hosted with ❤ by GitHub

As you can see, this version of the showtimes screen scraper relies entirely on reMatch() rather than xmlSearch(). But, just because this version approaches the problem in a different way, it doesn't mean that it is any less susceptible to problems. In either case, we are still depending on the predictable structure of a 3rd party page that we do not control. If that structure changes without notice, whether we use XML parsing or string pattern matching, our code might very well break.

In the long term, Google's markup, while significantly incomplete, seems to be easier to work with simply because it's all on one page and has better CSS class hooks (for pattern matching). If I am gonna play around more with screen scraping movie showtimes, I'll probably be using this service to do so.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/1884

Reader Comments

scott Apr 12, 2010 at 10:10 AM

10 Comments

Ben,
I have been trying to figure out an approach to screen-scraping for the website we build to aggregate event information around the state. Trying to get everyone to update their information is always an issue.

I'm not too swift when it comes to the whole issue of screen-scraping, then moving this data to a database. I've had limited success and it is different for each site.

What I would like to do is pull down event information form numerous sites, dump the information into a database, then upload it to our site. Is there an approach to this that is feasible? Am I just thinking too much and not working hard enough?

thanks.

Ben Nadel Apr 13, 2010 at 8:09 AM

16,020 Comments

@Scott,

Screen-scraping is never the, "right," solution; however, sometimes it is the "only" solution available. When I was working on Skin-Spider waaaay back in the day (it screen-scraped adult content), what I did was create a uniform CFC interface for the concept of screen scraping. Then, I created a separate CFC for each target website that uphelp the "scraping interface", but internally was set up specially for that site (based on its HTML and what not).

It's not an easy approach, for sure; and, it is likely to break if / whenever they change the markup. But, if it's all you go, abstracting it out into individual CFCs is really beneficial.

Also, if you are really serious about this, it can be a godsend to run the HTML through an "XHTML cleaner" first such that you can actually use xmlSearch(). That's what I was using TagSoup for a while back:

www.bennadel.com/blog/1723-Parsing-Invalid-HTML-Into-XML-Using-ColdFusion-Groovy-And-TagSoup.htm

This is a more complex solution since it used Groovy to load the JAR, which was then used to clean the HTML, which was then used by ColdFusion / xmlSearch. But, if you look, once you do that, you can treat the target HTML like it is XML, which makes scrapping MUCH easier.

Ben Nadel Apr 13, 2010 at 8:14 AM

16,020 Comments

@Scott,

You might also look into YQL (Yahoo Query Language). They have some serious support screen-scraping that I think does all of the XML/XHTML cleaning for you. I haven't looked into that much though.

scott Apr 13, 2010 at 12:08 PM

10 Comments

@Ben Nadel,

Ben, thanks for the info on screen-scraping. Not something I want to think about, but I may take a shot at a few of the sites to see how it all works. Thanks again.

Ben Nadel Apr 15, 2010 at 10:42 PM

16,020 Comments

@Scott,

No problem my man. If you hit any walls, drop a note here.

Royce Mar 24, 2011 at 12:53 PM

3 Comments

Hi there, I'm not sure what happened to your blog about the fandango site (I can't get to it any longer) so I'm posting this note on this blog entry.

Fandango seems to have changed their format drastically in the past few days. You may want to check it out and update and repost your blog about it.

regards,
Royce

Ben Nadel Mar 24, 2011 at 8:36 PM

16,020 Comments

@Royce,

Fandango sent me "cease and desist" order. Apparently my blog post violated the part of their Terms of Service that prevented me from "facilitating the unauthorized used" of their data. Oh well :)

Royce Mar 24, 2011 at 11:24 PM

3 Comments

Oh well, I guess it was too good to last. In the spirit of the internet, you should display the C&D letter where your old blog post was.

I've been looking for a reason to move to The Movie DB (http://api.themoviedb.org/2.1/) API, so I guess this is it.

Buh bye Fandango, I guess my sending folks your way to buy tickets wasn't good enough for ya!

Oh my chickens, this post is old!

Hit me up on Twitter if you want to discuss it further.

	<!--- Create an instance of the Google movie component. --->
	<cfset google = createObject( "component", "Google" ).init() />

	<!---
	Get the theater information for the showtimes at the
	Regal Union Square Stadium 14 theater TODAY.

	ID: 10dd19bd6f57c7c8 - Regal Union Square Stadium 14
	ID: 14c321fe7754e274 - AMC Empire 25

	NOTE: I had to get the theater ID off the website itself.
	--->
	<cfset theaterInfo = google.getTheaterInfo( "10dd19bd6f57c7c8" ) />


	<!--- Output theater information. --->
	<cfoutput>

	<p>
	<strong>#theaterInfo.title#</strong><br />
	</p>

	<!---
	Loop over the movies to output the movie titles and
	the times they are showing.
	--->
	<cfloop
	index="movie"
	array="#theaterInfo.movies#">

	<p>
	<strong>#movie.title#</strong><br />
	#arrayToList( movie.showtimes, ", " )#
	</p>

	</cfloop>

	</cfoutput>

	<cfcomponent
	output="false"
	hint="I help screen scrape the Google Movie showtimes.">


	<cffunction
	name="init"
	access="public"
	returntype="any"
	output="false"
	hint="I return an initialized component.">

	<!--- Define arguments. --->
	<cfargument
	name="baseURL"
	type="string"
	required="false"
	default="http://google.com/movies"
	hint="I am the base URL for the HTTP requests."
	/>

	<!--- Store properties. --->
	<cfset this.baseURL = arguments.baseURL />

	<!--- Return this object reference. --->
	<cfreturn this />
	</cffunction>


	<cffunction
	name="getTheaterInfo"
	access="public"
	returntype="struct"
	output="false"
	hint="I parse the showtimes for the given theater ID.">

	<!--- Define arguments. --->
	<cfargument
	name="theaterID"
	type="string"
	required="true"
	hint="I am the theater ID used by Fandango."
	/>

	<!--- Define the loacl scope. --->
	<cfset var local = {} />

	<!--- Define the theater structure. --->
	<cfset local.theaterInfo = {
	id = arguments.theaterID,
	title = "",
	movies = []
	} />

	<!---
	Grab the HTML off of the Google web page. With Google,
	you typically have to send some sort of User Agent
	because it will block a lot of user agents that it
	considers "bots."
	--->
	<cfhttp
	result="local.googleGet"
	method="get"
	url="#this.baseURL#?tid=#arguments.theaterID#"
	useragent="Mozilla/BenNadel.com"
	/>


	<!---
	While the HTML of the Google page is horrendously
	incomplete, it is thankfully well Classed enough to
	make string parsing somewhat straightfoward.
	--->

	<!--- Grab the theater title div. --->
	<cfset local.theaterDiv = reMatch(
	"<div class=theater>[\w\W]+?</span>",
	local.googleGet.fileContent
	) />

	<!---
	Get the theater title by stripping out all tags from
	the theater DIV. There is an H2 in there somewhere that
	has our theater name.
	--->
	<cfset local.theaterInfo.title = trim(
	reReplace(
	local.theaterDiv[ 1 ],
	"( \|</?\w+[^>]*>)",
	" ",
	"all"
	)
	) />


	<!--- Each movie is wrapped in a "movie" DIV that we can
	extract with some regular expression matching.
	--->
	<cfset local.movieDivs = reMatch(
	"<div class=movie>(?:\s\|<(\w+)[^>]*>.+?</\1>)+",
	local.googleGet.fileContent
	) />

	<!---
	At this point, we have chunks of strings that contain
	the movie data. Now, we have to loop over each one and
	parse the details.
	--->
	<cfloop
	index="local.movieDiv"
	array="#local.movieDivs#">

	<!--- Parse out the movie name DIV. --->
	<cfset local.nameDiv = reMatch(
	"<div class=name>.+?</div>",
	local.movieDiv
	) />

	<!--- Parse out the showtimes DIV. --->
	<cfset local.showtimesDiv = reMatch(
	"<div class=times>.+?</div>",
	local.movieDiv
	) />

	<!---
	Create a movie struct from the parsed DIVs. For
	this, we are basically going to take the pasred
	DIVs and strip out all tags, leaving just the
	textual data.
	--->
	<cfset local.movie = {
	title = trim(
	reReplace(
	local.nameDiv[ 1 ],
	"</?\w+[^>]*>",
	" ",
	"all"
	)
	),
	showtimes = listToArray(
	reReplace(
	local.showtimesDiv[ 1 ],
	"( \|</?\w+[^>]*>)",
	" ",
	"all"
	),
	" "
	)
	} />

	<!--- Append the movie to the ongoing collection. --->
	<cfset arrayAppend(
	local.theaterInfo.movies,
	local.movie
	) />

	</cfloop>

	<!--- Return the result. --->
	<cfreturn local.theaterInfo />
	</cffunction>

	</cfcomponent>