Short-Circuit Evaluation Is Fast

Posted June 2, 2006 at 8:23 AM

Tags: ColdFusion, Search Engine Optimization

As I wrote some time ago, taking Michael Dinowitz's advice, I turn off session management for Spiders and Bots in an effort to cut down on memory usage on the server. See, spiders do not accept client cookies and therefore (on my sites) cannot hold sessions. Consequently, they start a new session for each page request they make. Since sessions take some time to timeout, this ends of creating large numbers of session variables that go unused (in proportion to the number of pages spidered).

When I first did this I used a Regular Expression (RegEx) to check for commonly known spider user agents (CGI.http_user_agent). It looked something like:

 Launch code in new window » Download code as text file »

  • if (REFindNoCase( "slurp|googlebot|....", CGI.http_user_agent )){

This works great; however, I started adding more spiders to the list (as they started hitting my site) and I starting to fear that it wasn't efficient. If you ever look at how a regular expression works by using a program such as The RegEx Coach you can actually step through the RegEx path and you will see that for every character it comes across in the target sting, it does a LOT of logic for the regular expression. And, the larger the expression the more the logic.

This got me thinking about short-circuit evaluation. I am not sure which version brought this on board, but ColdFusion MX 7 has this feature, this optimization. This optimization means that evaluation of a relational expression in an IF statement is terminated just as soon as it is possible to tell what the result will be. Meaning that if you have several parts of a single IF statement and the first can determine the fate of the IF, then the remaining parts are not evaluated.

For example, in the following example, only the first value is checked:

if (false AND true AND true AND true){ ... }

Since the "false" makes the statement false no matter what the rest of the arguments are, the remaining "true" statement are not even evaluated.

I have taken this idea and applied it to the problem of turning off session management for spiders. Instead of using a regular expression, I break out each comparison to its own sub-part of an IF statement:

 Launch code in new window » Download code as text file »

  • // Define the application. To stop unnecessary memory usage, we are going
  • // to give web crawler no session management. This way, they don't have
  • // to worry about cookie acceptance and object persistence (except for
  • // APPLICATION scope). Here, we are using short-circuit evaluation on the
  • // IF statement with the most popular search engines at the top of the
  • // list. This will help us minimize the amount of time that it takes to
  • // evaluate the list.
  • if (
  • (NOT Len(CGI.http_user_agent)) OR
  • FindNoCase( "Slurp", CGI.http_user_agent ) OR
  • FindNoCase( "Googlebot", CGI.http_user_agent ) OR
  • FindNoCase( "BecomeBot", CGI.http_user_agent ) OR
  • FindNoCase( "msnbot", CGI.http_user_agent ) OR
  • FindNoCase( "Mediapartners-Google", CGI.http_user_agent ) OR
  • FindNoCase( "ZyBorg", CGI.http_user_agent ) OR
  • FindNoCase( "RufusBot", CGI.http_user_agent ) OR
  • FindNoCase( "EMonitor", CGI.http_user_agent ) OR
  • FindNoCase( "researchbot", CGI.http_user_agent ) OR
  • FindNoCase( "IP2MapBot", CGI.http_user_agent ) OR
  • FindNoCase( "GigaBot", CGI.http_user_agent ) OR
  • FindNoCase( "Jeeves", CGI.http_user_agent ) OR
  • FindNoCase( "Exabot", CGI.http_user_agent ) OR
  • FindNoCase( "SBIder", CGI.http_user_agent ) OR
  • FindNoCase( "findlinks", CGI.http_user_agent ) OR
  • FindNoCase( "YahooSeeker", CGI.http_user_agent ) OR
  • FindNoCase( "MMCrawler", CGI.http_user_agent ) OR
  • FindNoCase( "MJ12bot", CGI.http_user_agent ) OR
  • FindNoCase( "OutfoxBot", CGI.http_user_agent ) OR
  • FindNoCase( "jBrowser", CGI.http_user_agent ) OR
  • FindNoCase( "ZiggsBot", CGI.http_user_agent ) OR
  • FindNoCase( "Java", CGI.http_user_agent ) OR
  • FindNoCase( "PMAFind", CGI.http_user_agent ) OR
  • FindNoCase( "Blogbeat", CGI.http_user_agent ) OR
  • FindNoCase( "TurnitinBot", CGI.http_user_agent ) OR
  • FindNoCase( "ConveraCrawler", CGI.http_user_agent ) OR
  • FindNoCase( "Ocelli", CGI.http_user_agent ) OR
  • FindNoCase( "Labhoo", CGI.http_user_agent ) OR
  • FindNoCase( "Validator", CGI.http_user_agent ) OR
  • FindNoCase( "sproose", CGI.http_user_agent ) OR
  • FindNoCase( "oBot", CGI.http_user_agent ) OR
  • FindNoCase( "MyFamilyBot", CGI.http_user_agent ) OR
  • FindNoCase( "Girafabot", CGI.http_user_agent ) OR
  • FindNoCase( "aipbot", CGI.http_user_agent ) OR
  • FindNoCase( "ia_archiver", CGI.http_user_agent ) OR
  • FindNoCase( "Snapbot", CGI.http_user_agent ) OR
  • FindNoCase( "Larbin", CGI.http_user_agent ) OR
  • FindNoCase( "psycheclone", CGI.http_user_agent ) OR
  • FindNoCase( "ColdFusion", CGI.http_user_agent )
  • ){
  •  
  • // This application definition is for robots that do NOT need sessions.
  • THIS.Name = "KinkySolutions v.1 {dev}";
  • THIS.SessionManagement = false;
  • THIS.SetClientCookies = false;
  • THIS.ClientManagement = false;
  • THIS.SetDomainCookies = false;
  •  
  • // Set the flag for session use.
  • REQUEST.HasSessionScope = false;
  •  
  • } else {
  •  
  • // This application is for the standard user.
  • THIS.Name = "KinkySolutions v.1 {dev}";
  • THIS.SessionManagement = true;
  • THIS.SetClientCookies = true;
  • THIS.SessionTimeout = CreateTimeSpan(0, 0, 20, 0);
  • THIS.LoginStorage = "SESSION";
  •  
  • // Set the flag for session use.
  • REQUEST.HasSessionScope = true;
  •  
  • }

Now, regular expressions do short-circuit evaluation also, so the difference here is subtle. Let's say that we get a page request from a non-spider user agent. This is the "worst case" scenario since we will have to check every spider value against the string. With a regular expression, we would have to run through the matching processing for each of the (N) spider values for each of the (C) characters in the user agent. That's NxC iterations. However, in the compound IF statement, we would only have to run the matching process for each spider for each (U) instance of a user agent. That's just NxU and since U is always one, its just N number of iterations.

Now this is misleading because for string comparison, the substrings still have to match against many characters in the target string, but I am sure (but do not know for a fact) that literal matching must be faster than RegEx matching since there is not "logic" to literal matching.

If we do get a spider request that is a popular spider (higher in the IF statement, earlier in the regular expression), it's still faster to have the compound IF statement. See, the regular expression still needs to be checked in it's entirety for EACH character it comes across in the target string. But the IF statement only needs a sub-set of the IF sub-part run just once.

Of course, in practicality, they all run between 0-16ms per page hit. With large iterations (10,000+), the compound IF statement is levels of magnitude faster.

Furthermore, you can make it even faster by creating a temporary string of the LCase() of the user agent and then doing Find() rather than FindNoCase() for each sub-part (not shown above).

Download Code Snippet ZIP File

Comments (2)  |  Post Comment  |  Ask Ben  |  Permalink  |  Other Searches  |  Print Page



Adobe ColdFusion 8.0.1 Update - Helping Programmers To Be Signifanctly Less Girlie - Download ColdFusion 8 Update 8.0.1 Now.

Reader Comments

Would a switch statement be even faster in this case Ben?

Posted by Andy Matthews on Oct 24, 2007 at 3:48 PM


@Andy,

A Switch statement would not quite work in a situation like this since we are not matching on the entire user agent, just parts of it.

Posted by Ben Nadel on Oct 24, 2007 at 7:36 PM


Post Comment  |  Ask Ben


Home   |   Web Log   |   ColdFusion   |   Projects   |   Resume   |   Job Form   |   Search   |   Contact
Epicenter Consulting - Custom Software Solutions for Business Evolution HostMySite.com - The Leader In ColdFusion Hosting