Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
I am the chief technical officer at InVision App, Inc - a prototyping and collaboration platform for designers, built by designers. I also rock out in JavaScript and ColdFusion 24x7.
Meanwhile on Twitter
Loading latest tweet...
Ben Nadel at cf.Objective() 2014 (Bloomington, MN) with: Matt Vickers and Jonathan Dowdle

Screen-Scraping Movie Showtimes Off Google.com With ColdFusion

By Ben Nadel on
Tags: ColdFusion

Yesterday, I experimented with scraping movie showtimes off of the iPhone version of Fandango.com. Today, I wanted to try and do the same thing with the Google.com movie showtimes service. This actually provides an interesting context because it's two very different approaches to the same problem. With Fandango.com, we get XHTML that is so compliant that we can actually parse it into XML and use XPath to query the document. Google, on the other hand, is so conscious about bandwidth usage that they make their HTML as dirty and as incomplete as possible so long as it still renders properly. As such, when we deal with Google's markup, we have to fall back to string parsing and pattern matching rather than DOM querying.

Because I was solving the same problem, I actually wanted to build the same API. So, for this demo, you'll see that the ColdFusion code is almost exactly the same as the code used in the Fandango.com demo:

  • <!--- Create an instance of the Google movie component. --->
  • <cfset google = createObject( "component", "Google" ).init() />
  •  
  • <!---
  • Get the theater information for the showtimes at the
  • Regal Union Square Stadium 14 theater TODAY.
  •  
  • ID: 10dd19bd6f57c7c8 - Regal Union Square Stadium 14
  • ID: 14c321fe7754e274 - AMC Empire 25
  •  
  • NOTE: I had to get the theater ID off the website itself.
  • --->
  • <cfset theaterInfo = google.getTheaterInfo( "10dd19bd6f57c7c8" ) />
  •  
  •  
  • <!--- Output theater information. --->
  • <cfoutput>
  •  
  • <p>
  • <strong>#theaterInfo.title#</strong><br />
  • </p>
  •  
  • <!---
  • Loop over the movies to output the movie titles and
  • the times they are showing.
  • --->
  • <cfloop
  • index="movie"
  • array="#theaterInfo.movies#">
  •  
  • <p>
  • <strong>#movie.title#</strong><br />
  • #arrayToList( movie.showtimes, ", " )#
  • </p>
  •  
  • </cfloop>
  •  
  • </cfoutput>

Pretty much, the only difference here is that I am instantiating a ColdFusion component called "Google" rather than one called "Fandango." Both of these CFCs have the same public API, which is the method, getTheaterInfo(). This method returns the same structure in both cases. This is the nicest thing about creating an API - that you can change the underlying engine without changing the code that relies on it.

When we run the above code, we get the following page output:

NOTE: Movie data removed at the request of data owner.

The ColdFusion component that powers this is somewhat less complex than the Fandango one because all the movies are listed on one page. In the Fandango version, I had to make several CFHTTP page requests to gather all of the showtime information; but on Google, it's all right there. Of course, this time, I have to rely on Regular Expression pattern matching rather than XPath; but it's not too much more complex.

Google.cfc

  • <cfcomponent
  • output="false"
  • hint="I help screen scrape the Google Movie showtimes.">
  •  
  •  
  • <cffunction
  • name="init"
  • access="public"
  • returntype="any"
  • output="false"
  • hint="I return an initialized component.">
  •  
  • <!--- Define arguments. --->
  • <cfargument
  • name="baseURL"
  • type="string"
  • required="false"
  • default="http://google.com/movies"
  • hint="I am the base URL for the HTTP requests."
  • />
  •  
  • <!--- Store properties. --->
  • <cfset this.baseURL = arguments.baseURL />
  •  
  • <!--- Return this object reference. --->
  • <cfreturn this />
  • </cffunction>
  •  
  •  
  • <cffunction
  • name="getTheaterInfo"
  • access="public"
  • returntype="struct"
  • output="false"
  • hint="I parse the showtimes for the given theater ID.">
  •  
  • <!--- Define arguments. --->
  • <cfargument
  • name="theaterID"
  • type="string"
  • required="true"
  • hint="I am the theater ID used by Fandango."
  • />
  •  
  • <!--- Define the loacl scope. --->
  • <cfset var local = {} />
  •  
  • <!--- Define the theater structure. --->
  • <cfset local.theaterInfo = {
  • id = arguments.theaterID,
  • title = "",
  • movies = []
  • } />
  •  
  • <!---
  • Grab the HTML off of the Google web page. With Google,
  • you typically have to send some sort of User Agent
  • because it will block a lot of user agents that it
  • considers "bots."
  • --->
  • <cfhttp
  • result="local.googleGet"
  • method="get"
  • url="#this.baseURL#?tid=#arguments.theaterID#"
  • useragent="Mozilla/BenNadel.com"
  • />
  •  
  •  
  • <!---
  • While the HTML of the Google page is horrendously
  • incomplete, it is thankfully well Classed enough to
  • make string parsing somewhat straightfoward.
  • --->
  •  
  • <!--- Grab the theater title div. --->
  • <cfset local.theaterDiv = reMatch(
  • "<div class=theater>[\w\W]+?</span>",
  • local.googleGet.fileContent
  • ) />
  •  
  • <!---
  • Get the theater title by stripping out all tags from
  • the theater DIV. There is an H2 in there somewhere that
  • has our theater name.
  • --->
  • <cfset local.theaterInfo.title = trim(
  • reReplace(
  • local.theaterDiv[ 1 ],
  • "(&nbsp;|</?\w+[^>]*>)",
  • " ",
  • "all"
  • )
  • ) />
  •  
  •  
  • <!--- Each movie is wrapped in a "movie" DIV that we can
  • extract with some regular expression matching.
  • --->
  • <cfset local.movieDivs = reMatch(
  • "<div class=movie>(?:\s|<(\w+)[^>]*>.+?</\1>)+",
  • local.googleGet.fileContent
  • ) />
  •  
  • <!---
  • At this point, we have chunks of strings that contain
  • the movie data. Now, we have to loop over each one and
  • parse the details.
  • --->
  • <cfloop
  • index="local.movieDiv"
  • array="#local.movieDivs#">
  •  
  • <!--- Parse out the movie name DIV. --->
  • <cfset local.nameDiv = reMatch(
  • "<div class=name>.+?</div>",
  • local.movieDiv
  • ) />
  •  
  • <!--- Parse out the showtimes DIV. --->
  • <cfset local.showtimesDiv = reMatch(
  • "<div class=times>.+?</div>",
  • local.movieDiv
  • ) />
  •  
  • <!---
  • Create a movie struct from the parsed DIVs. For
  • this, we are basically going to take the pasred
  • DIVs and strip out all tags, leaving just the
  • textual data.
  • --->
  • <cfset local.movie = {
  • title = trim(
  • reReplace(
  • local.nameDiv[ 1 ],
  • "</?\w+[^>]*>",
  • " ",
  • "all"
  • )
  • ),
  • showtimes = listToArray(
  • reReplace(
  • local.showtimesDiv[ 1 ],
  • "(&nbsp;|</?\w+[^>]*>)",
  • " ",
  • "all"
  • ),
  • " "
  • )
  • } />
  •  
  • <!--- Append the movie to the ongoing collection. --->
  • <cfset arrayAppend(
  • local.theaterInfo.movies,
  • local.movie
  • ) />
  •  
  • </cfloop>
  •  
  • <!--- Return the result. --->
  • <cfreturn local.theaterInfo />
  • </cffunction>
  •  
  • </cfcomponent>

As you can see, this version of the showtimes screen scraper relies entirely on reMatch() rather than xmlSearch(). But, just because this version approaches the problem in a different way, it doesn't mean that it is any less susceptible to problems. In either case, we are still depending on the predictable structure of a 3rd party page that we do not control. If that structure changes without notice, whether we use XML parsing or string pattern matching, our code might very well break.

In the long term, Google's markup, while significantly incomplete, seems to be easier to work with simply because it's all on one page and has better CSS class hooks (for pattern matching). If I am gonna play around more with screen scraping movie showtimes, I'll probably be using this service to do so.




Reader Comments

Ben,
I have been trying to figure out an approach to screen-scraping for the website we build to aggregate event information around the state. Trying to get everyone to update their information is always an issue.

I'm not too swift when it comes to the whole issue of screen-scraping, then moving this data to a database. I've had limited success and it is different for each site.

What I would like to do is pull down event information form numerous sites, dump the information into a database, then upload it to our site. Is there an approach to this that is feasible? Am I just thinking too much and not working hard enough?

thanks.

Reply to this Comment

@Scott,

Screen-scraping is never the, "right," solution; however, sometimes it is the "only" solution available. When I was working on Skin-Spider waaaay back in the day (it screen-scraped adult content), what I did was create a uniform CFC interface for the concept of screen scraping. Then, I created a separate CFC for each target website that uphelp the "scraping interface", but internally was set up specially for that site (based on its HTML and what not).

It's not an easy approach, for sure; and, it is likely to break if / whenever they change the markup. But, if it's all you go, abstracting it out into individual CFCs is really beneficial.

Also, if you are really serious about this, it can be a godsend to run the HTML through an "XHTML cleaner" first such that you can actually use xmlSearch(). That's what I was using TagSoup for a while back:

http://www.bennadel.com/blog/1723-Parsing-Invalid-HTML-Into-XML-Using-ColdFusion-Groovy-And-TagSoup.htm

This is a more complex solution since it used Groovy to load the JAR, which was then used to clean the HTML, which was then used by ColdFusion / xmlSearch. But, if you look, once you do that, you can treat the target HTML like it is XML, which makes scrapping MUCH easier.

Reply to this Comment

@Scott,

You might also look into YQL (Yahoo Query Language). They have some serious support screen-scraping that I think does all of the XML/XHTML cleaning for you. I haven't looked into that much though.

Reply to this Comment

@Ben Nadel,

Ben, thanks for the info on screen-scraping. Not something I want to think about, but I may take a shot at a few of the sites to see how it all works. Thanks again.

Reply to this Comment

Hi there, I'm not sure what happened to your blog about the fandango site (I can't get to it any longer) so I'm posting this note on this blog entry.

Fandango seems to have changed their format drastically in the past few days. You may want to check it out and update and repost your blog about it.

regards,
Royce

Reply to this Comment

@Royce,

Fandango sent me "cease and desist" order. Apparently my blog post violated the part of their Terms of Service that prevented me from "facilitating the unauthorized used" of their data. Oh well :)

Reply to this Comment

Oh well, I guess it was too good to last. In the spirit of the internet, you should display the C&D letter where your old blog post was.

I've been looking for a reason to move to The Movie DB (http://api.themoviedb.org/2.1/) API, so I guess this is it.

Buh bye Fandango, I guess my sending folks your way to buy tickets wasn't good enough for ya!

Reply to this Comment

Post A Comment

You — Get Out Of My Dreams, Get Into My Comments
Live in the Now
Oops!
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.