Posted April 9, 2007 at
8:21 AM
Tags:
ColdFusion
Based on the popular game, "Six Degrees of Kevin Bacon," I have created a much smaller version for Ben Forta in which you can enter your domain name and find the blog reference chain that leads to your domain (ie. Ben Forta references a blog that references a blog that references your blog). Due to the huge size of the web, I have selected a very small population of blogs to work with. These include all the blogs "on tap" over on Full As A Goog. If I didn't do this, I would have NO idea how to even go about creating something like this.
Click here to give it a go (as seen in the screen shot below):
Try some random blogs:
Peter Bell
Sean Corfield
Kay Smoljak
Tony Weeg
Building the application was actually fairly simple - much more so than I thought it would be. What took a long time (I let the spider run over the weekend) was amassing all the blog reference links (finding pages in which one blog refers to another blog). There are over 400 blogs on-tap on Full As A Goog. In order to find all the references, I basically had to create a 400 x 400 grid in which every blog was tested for references to every other blog. To find the references, I used CFHttp and grabbed site-specific search results off of Google.
Two database tables were involved:
forta_web
This was a table that housed the blog URLs spidered off of Full As A Goog:
- id - INT
- url - VARCHAR( 100 )
- search_url - VARCHAR( 100 )
- is_root - TINYINT
The url field was the http url for the blog. The search_url field was the "google-friendly" url that was being searched for. This stripped out HTTP, www, and other URL elements that were too narrowing. is_root was a flag for Ben Forta's web blog.
forta_web_jn
This was the join table that kept track of the blog-to-blog references that were found on Google:
- id - INT
- title - VARCHAR( 500 )
- url - VARCHAR( 500 )
- url_id_1 - INT
- url_id_2 - INT
The title and url fields were the search result elements returned in the Google search results. The url ID fields were the foreign keys referencing the forta_web table.
Step 1: Spidering Full As A Goog
Before I could do anything, I had to grab all of the blog URLs off of Full As A Goog. To do this, I used CFHttp to grab the "on-tap" page. Then I used a Java pattern matcher to find the blog urls:
Launch code in new window » Download code as text file »
- <cfhttp
- url="http://fullasagoog.com/blogsontap.cfm"
- method="GET"
- useragent="#CGI.http_user_agent#"
- result="objHTTP"
- />
-
-
- <cfset objPattern = CreateObject(
- "java",
- "java.util.regex.Pattern"
- ).Compile(
- "(?i)<a class=""cssbtn btnauth"" href=""([^""]+)""><strong> URL"
- ) />
-
- <cfset objMatcher = objPattern.Matcher(
- objHTTP.FileContent
- ) />
-
-
- <cfloop condition="objMatcher.Find()">
-
- <cfset strLink = objMatcher.Group( 1 ).ReplaceFirst(
- "(?i)^https?://(www\.)?([^\\\/]+).*", "$2"
- ).Trim()
- />
-
-
- <cfset strSearchLink = objMatcher.Group( 1 ).ReplaceFirst(
- "(?i)^https?://(www\.)?",
- ""
- ).ReplaceFirst(
- "([\\\/]{1})[^\\\/]+\.[\w]{2,4}$",
- "$1"
- ).ReplaceAll(
- "[^\w]+",
- " "
- ).Trim()
- />
-
-
- <cfif REFindNoCase( "forta.com", strLink )>
- <cfset intRoot = 1 />
- <cfelse>
- <cfset intRoot = 0 />
- </cfif>
-
-
- <cfquery name="qInsert" datasource="#REQUEST.DSN.Source#">
- DECLARE
- @id INT,
- @url VARCHAR( 100 ),
- @search_url VARCHAR( 100 ),
- @is_root TINYINT
- ;
-
-
- SET @url = <cfqueryparam value="#strLink#" cfsqltype="CF_SQL_VARCHAR" />;
- SET @search_url = <cfqueryparam value="#strSearchLink#" cfsqltype="CF_SQL_VARCHAR" />;
- SET @is_root = <cfqueryparam value="#intRoot#" cfsqltype="CF_SQL_TINYINT" />;
-
-
- SET @id = ISNULL(
- (
- SELECT
- f.id
- FROM
- forta_web f
- WHERE
- url = @url
- ),
- 0
- );
-
-
- IF (@id = 0)
- BEGIN
-
- INSERT INTO forta_web
- (
- url,
- search_url,
- is_root
- ) VALUES (
- @url,
- @search_url,
- @is_root
- );
-
- END
- </cfquery>
-
- </cfloop>
-
- Done.
Notice that for each blog URL I get two values - the URL and the "Search Url". From some quick trial and error, I found that Google would strip out certain values of a URL when searching for URLs. In order to get better Google search results, I did this as I spidered the blog URLs.
Step 2: Building The Blog-to-Blog References
This was by far the most time consuming aspect of the experiment. For this, I had to use CFHttp Google to find all the references from every blog to every other blog. I am not sure if this is the best way to do it, but this was all I could come up with. If I estimate that there are 400 blogs on Full As A Goog, then that means I had to check for around 160,000 blog-to-blog references. Yikes!
Launch code in new window » Download code as text file »
- <cfsetting
- requesttimeout="350"
- />
-
-
- <cfparam name="URL.id1" type="numeric" default="0" />
- <cfparam name="URL.id2" type="numeric" default="0" />
-
-
- <cfif NOT URL.id1>
-
- <cfquery name="qID1" datasource="#REQUEST.DSN.Source#">
- SELECT
- f.id,
- f.url,
- f.search_url,
- f.is_root
- FROM
- forta_web f
- WHERE
- f.is_root = 1
- </cfquery>
-
- <cfelse>
-
- <cfquery name="qID1" datasource="#REQUEST.DSN.Source#">
- SELECT
- f.id,
- f.url,
- f.search_url,
- f.is_root
- FROM
- forta_web f
- WHERE
- f.id = <cfqueryparam value="#URL.id1#" cfsqltype="CF_SQL_INTEGER" />
- </cfquery>
-
- </cfif>
-
-
- <cfset URL.id1 = Val( qID1.id ) />
-
-
- <cfif NOT URL.id2>
-
- <cfquery name="qID2" datasource="#REQUEST.DSN.Source#">
- SELECT TOP 100
- f.id,
- f.url,
- f.search_url,
- f.is_root
- FROM
- forta_web f
- WHERE
- f.is_root = 0
- AND
- f.id != <cfqueryparam value="#URL.id1#" cfsqltype="CF_SQL_INTEGER" />
- ORDER BY
- f.id ASC
- </cfquery>
-
- <cfelse>
-
- <cfquery name="qID2" datasource="#REQUEST.DSN.Source#">
- SELECT TOP 100
- f.id,
- f.url,
- f.search_url,
- f.is_root
- FROM
- forta_web f
- WHERE
- f.id >= <cfqueryparam value="#URL.id2#" cfsqltype="CF_SQL_INTEGER" />
- AND
- f.id != <cfqueryparam value="#URL.id1#" cfsqltype="CF_SQL_INTEGER" />
- AND
- f.is_root = 0
- ORDER BY
- f.id ASC
- </cfquery>
-
- </cfif>
-
-
- <cfset URL.id2 = Val( qID2.id ) />
-
-
- <cfif (qID1.RecordCount AND qID2.RecordCount)>
-
-
- <cfloop query="qID2">
-
- <cfhttp
- url="http://www.google.com/search?num=10&hl=en&lr=&as_qdr=all&q=site%3A#UrlEncodedFormat( qID1.url )#+%22#UrlEncodedFormat( qID2.search_url )#%22&btnG=Search"
- method="GET"
- useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3"
- result="objHTTP"
- />
-
-
- <cfif NOT FindNoCase( "did not match any documents", objHTTP.FileContent )>
-
- based user agent requests. Not sure why, but we
- can leverage it none the less. In this pattern,
- we are matching both the link title (group 2)
- and the link (group 1).
- --->
- <cfset objPattern = CreateObject(
- "java",
- "java.util.regex.Pattern"
- ).Compile(
- "(?i).+?href=""([^""]+)""[^>]*>(.+?)</a>"
- ) />
-
- <cfset objMatcher = objPattern.Matcher(
- objHTTP.FileContent
- ) />
-
-
- <cfloop condition="objMatcher.Find()">
-
- <cfset strLink = objMatcher.Group( 1 ) />
- <cfset strText = objMatcher.Group( 2 ) />
-
-
- <cfquery name="qExists" datasource="#REQUEST.DSN.Source#">
- SELECT
- id
- FROM
- forta_web_jn
- WHERE
- url_id_1 = <cfqueryparam value="#qID1.id#" cfsqltype="CF_SQL_INTEGER" />
- AND
- url_id_2 = <cfqueryparam value="#qID2.id#" cfsqltype="CF_SQL_INTEGER" />
- AND
- LOWER( url ) = <cfqueryparam value="#LCase( strLink )#" cfsqltype="CF_SQL_VARCHAR" />
- </cfquery>
-
-
- <cfif NOT qExists.RecordCount>
-
- <cfquery name="qInsert" datasource="#REQUEST.DSN.Source#">
- INSERT INTO forta_web_jn
- (
- title,
- url,
- url_id_1,
- url_id_2
- ) VALUES (
- <cfqueryparam value="#strText#" cfsqltype="CF_SQL_VARCHAR" />,
- <cfqueryparam value="#strLink#" cfsqltype="CF_SQL_VARCHAR" />,
- <cfqueryparam value="#qID1.id#" cfsqltype="CF_SQL_INTEGER" />,
- <cfqueryparam value="#qID2.id#" cfsqltype="CF_SQL_INTEGER" />
- );
- </cfquery>
-
- Link Inserted
-
- </cfif>
-
- </cfloop>
-
- </cfif>
-
-
- </cfloop>
-
- </cfif>
-
-
- <cfquery name="qNextID2" datasource="#REQUEST.DSN.Source#">
- SELECT TOP 1
- f.id,
- f.url,
- f.search_url
- FROM
- forta_web f
- WHERE
- <cfif qID2.RecordCount>
- f.id > <cfqueryparam value="#ArrayMax( qID2[ 'id' ] )#" cfsqltype="CF_SQL_INTEGER" />
- <cfelse>
- f.id > <cfqueryparam value="#URL.id2#" cfsqltype="CF_SQL_INTEGER" />
- </cfif>
- AND
- f.id != <cfqueryparam value="#URL.id1#" cfsqltype="CF_SQL_INTEGER" />
- AND
- f.is_root = 0
- ORDER BY
- f.id ASC
- </cfquery>
-
-
- <cfif NOT qNextID2.RecordCount>
-
- <cfquery name="qNextID1" datasource="#REQUEST.DSN.Source#">
- SELECT TOP 1
- f.id,
- f.url,
- f.search_url
- FROM
- forta_web f
-
- WHERE
- f.is_root = 0
-
- <cfif Val( qID1.is_root )>
-
- AND
- f.id > 0
-
- <cfelse>
-
- AND
- f.id > <cfqueryparam value="#URL.id1#" cfsqltype="CF_SQL_INTEGER" />
-
- </cfif>
-
- ORDER BY
- f.id ASC
- </cfquery>
-
-
- <cfif qNextID1.RecordCount>
-
- <cfoutput>
-
- <script type="text/javascript">
- setTimeout(
- function(){
- location.href = "#CGI.script_name#?id1=#qNextID1.id#";
- },
- 1500
- );
- </script>
-
- </cfoutput>
-
- <cfelse>
-
- Done.
-
- </cfif>
-
-
- <cfelse>
-
- <cfoutput>
-
- <script type="text/javascript">
- setTimeout(
- function(){
- location.href = "#CGI.script_name#?id1=#URL.id1#&id2=#qNextID2.id#";
- },
- 1500
- );
- </script>
-
- </cfoutput>
-
- </cfif>
Notice that at the end of the page, I am refreshing using Javascript setTimeout() calls. This has two reasons behind it: 1, it gave the server a tad bit of rest between bouts of processing (1.5 seconds). And 2, I get uncomfortable running CFLocation after CFLocation after CFLocation. Something about it just rubs me the wrong way. Plus, I think sometimes the browser doesn't like this, and I didn't want the browser killing the refreshes while I wasn't here (remember, I let this run over the weekend).
Step 3: Finding The Referential Blog Chain
Finding the blog referral chain proved much easier than I thought it was going to be. We know how many steps we can have (six), we know which blog we need to end with (your blog), and we know which blog we need to start with (Ben Forta's). Finding the chain was as easy and starting with yours and walking backwards until we found Forta's:
Launch code in new window » Download code as text file »
- <form action="#CGI.script_name#" method="post">
-
- <h3>
- Enter your Domain:
- </h3>
-
- <p>
- <input
- type="text"
- name="domain"
- value="#FORM.domain.ReplaceAll( "("")", "$1$1" )#"
- size="50"
- />
-
- <input
- type="submit"
- value="Search"
- />
- </p>
-
- </form>
-
-
- <cfif Len( FORM.domain )>
-
- <cfset strCleanDomain = FORM.domain.ReplaceFirst(
- "(?i)^(https?://)?(www\.)?",
- ""
- ).ReplaceFirst(
- "([\\\/]{1})[^\\\/]+\.[\w]{2,4}$",
- "$1"
- ).ReplaceAll(
- "[^\w]+",
- " "
- ).Trim() />
-
-
- <p>
- <em>Searching for "#strCleanDomain#"</em>
- </p>
-
-
- <cfflush />
-
-
- <cfquery name="qTargetDomain" datasource="#REQUEST.DSN.Source#">
- SELECT
- f.id,
- f.url,
- f.search_url,
-
- (
- SELECT TOP 1
- f2.id
- FROM
- forta_web f2
- WHERE
- f2.is_root = 1
- ORDER BY
- f2.id ASC
- ) AS root_id
- FROM
- forta_web f
- WHERE
- f.is_root = 0
- AND
- (
- f.search_url LIKE <cfqueryparam value="%#strCleanDomain#%" cfsqltype="CF_SQL_VARCHAR" />
- OR
- f.url LIKE <cfqueryparam value="%#strCleanDomain#%" cfsqltype="CF_SQL_VARCHAR" />
- )
- </cfquery>
-
-
- <cfif qTargetDomain.RecordCount>
-
-
- <cfset arrPath = ArrayNew( 1 ) />
-
-
- <cfset objNodes = StructNew() />
-
- <cfset objNodes[ qTargetDomain.id ] = StructNew() />
- <cfset objNodes[ qTargetDomain.id ].JoinID = 0 />
- <cfset objNodes[ qTargetDomain.id ].TargetID = 0 />
-
- <cfset ArrayAppend( arrPath, objNodes ) />
-
-
- <cfloop
- index="intDepth"
- from="2"
- to="6"
- step="1">
-
-
- <cfset lstIDs = StructKeyList( arrPath[ 1 ] ) />
-
- <cfquery name="qNodeDomain" datasource="#REQUEST.DSN.Source#">
- SELECT
- fwjn.id,
- fwjn.url_id_1,
- fwjn.url_id_2
- FROM
- forta_web_jn fwjn
- WHERE
- fwjn.url_id_2 IN ( <cfqueryparam value="#lstIDs#,0" cfsqltype="CF_SQL_INTEGER" list="yes" /> )
- </cfquery>
-
-
- <cfset objNodes = StructNew() />
-
- <cfloop query="qNodeDomain">
-
- <cfset objNodes[ qNodeDomain.url_id_1 ] = StructNew() />
- <cfset objNodes[ qNodeDomain.url_id_1 ].JoinID = qNodeDomain.id />
- <cfset objNodes[ qNodeDomain.url_id_1 ].TargetID = qNodeDomain.url_id_2 />
-
- </cfloop>
-
-
- <cfset ArrayPrepend( arrPath, objNodes ) />
-
-
- <cfif (
- StructKeyExists( objNodes, qTargetDomain.root_id ) OR
- (NOT StructCount( objNodes ))
- )>
-
- <cfbreak />
-
- </cfif>
-
- </cfloop>
-
-
- <cfif StructKeyExists( arrPath[ 1 ], qTargetDomain.root_id )>
-
-
- <p>
- A connection to Ben Forta was found!
- </p>
-
-
- <cfset intSourceID = qTargetDomain.root_id />
-
-
- <cfloop
- index="intStep"
- from="1"
- to="#ArrayLen( arrPath )#"
- step="1">
-
-
- <cfset objNode = arrPath[ intStep ] />
-
-
- <cfset objStep = objNode[ intSourceID ] />
-
- <cfquery name="qStep" datasource="#REQUEST.DSN.Source#">
- SELECT
- fwjn.url,
- fwjn.title,
- fwjn.url_id_1,
- fwjn.url_id_2,
- ( f1.url ) AS source_url,
- ( f2.url ) AS target_url
- FROM
- forta_web_jn fwjn
- INNER JOIN
- forta_web f1
- ON
- fwjn.url_id_1 = f1.id
- INNER JOIN
- forta_web f2
- ON
- fwjn.url_id_2 = f2.id
- WHERE
- fwjn.id = <cfqueryparam value="#objNode[ intSourceID ].JoinID#" cfsqltype="CF_SQL_INTEGER" />
- </cfquery>
-
-
- <h3>
- Step #intStep#
- </h3>
-
- <p>
- <strong>#qStep.source_url#</strong> - to -
- <strong>#qStep.target_url#</strong> via:<br />
- <a href="#qStep.url#" target="_blank">#qStep.url#</a>
- </p>
-
- <cfset intSourceID = qStep.url_id_2 />
-
-
- <cfif (intSourceID EQ qTargetDomain.id)>
-
- <p>
- <em>Done!</em>
- </p>
-
- <cfbreak />
-
- </cfif>
-
- </cfloop>
-
-
- <cfelse>
-
- <p>
- <em>No connection to Ben Forta could be found :(</em>
- </p>
-
- </cfif>
-
-
- <cfelse>
-
-
- <p>
- <em>That domain was not found on FullAsAGoog's Blogs on Tap.</em>
- </p>
-
-
- </cfif>
-
- </cfif>
That's all there is to it. This was a neat little experiment done on a small scale. It's not the most accurate and certainly not comprehensive; I have absolutely no idea how you would accomplish something like this on a more grand scale. I have no idea how you would even update the "web" of blog-to-blog references. I guess that is why Google has a bazillion computers all spidering the web all the time.
Download Code Snippet ZIP File
Comments (8) |
Post Comment |
Ask Ben |
Permalink |
Print Page
What a creative idea Ben! I dig it.
Posted by Dave Shuck
on Apr 9, 2007
at 12:48 PM
Thanks Dave. Can't let Kevin Bacon have all the fun ;)
Posted by Ben Nadel
on Apr 9, 2007
at 12:50 PM
Ben's ma daddy.
I win
Posted by Critter
on Apr 9, 2007
at 12:58 PM
Small and nice game. Thanks!
Posted by Duncan Flooring
on Apr 12, 2007
at 7:50 AM
You make a lot of fun little aps man. :) 2 steps away -- through Peter Bells blog here!
Posted by Adam Fortuna
on Apr 13, 2007
at 1:46 PM
@Adam,
Thanks man. I like to have a lot of fun with this ColdFusion stuff. The scope of this app is fairly small when you consider that like 3 new blogs are created every second. I can't even imagine how something like this would be maintained on a large scale.
Posted by Ben Nadel
on Apr 13, 2007
at 1:56 PM
Good information thanks..
Posted by gizli kamera
on Apr 29, 2007
at 4:55 PM
Thank you very much Dave,
Posted by St Nicholas Picture
on May 11, 2007
at 6:21 AM
Post Comment |
Ask Ben