I had a series of blog posts a while back that discussed ColdFusion session management and spiders. In those posts, I was actually disabling session management for users that were believed to be spiders or bots. This was a technique that I originally learned from Michael Dinowitz. Some time later, in a discussion also with Michael Dinowitz, he was explaining to me that he no longer did this. Instead, he used a slightly altered technique in which all users get session management with the caveat that the session timeout given to spiders and bots is much smaller (around 2 seconds).
When you have one user that gets session management and one user that does not, your page requires additional logic in all places that touch the user's session object; certain code will have to be excluded from execution if a user has no session. This new technique allows the page to execute without exception cases while at the same time accounting for the "pseudo memory leak" caused by extensive and cookieless spider traffic.
The entries that I have been posting recently about sessions that expire mid-page were in preparation for this post. I wanted to make sure that giving a user a very short session timeout would not cause problems on pages that had a longer than usual execution time. And, since we have found that the SESSION object is available for the entire request no matter what happens in terms of a timeout, I now feel it is safe to introduce this code.
While I use Application.cfc almost exclusively, I have decided to demonstrate this using Application.cfm. Some people have asked to see my original session management posts with the CFApplication tag, and since these are along the same lines, I figured I would downgrade the example to cover more bases.
All the important logic here takes place in the Application.cfm where the ColdFusion application is defined:
Launch code in new window » Download code as text file »
When it comes to defining the CFApplication tag with our goal in mind, the only different is the SessionTimeout property. Everything else about the CFApplication tag is exactlt the same. As such, we are using our logic to store the session timeout in a variable and then just defining the application in one place with one CFApplication tag.
In order to set the proper session timeout, we are testing the user agent making the page request. As discussed in my previous posts on ColdFusion session management, many popular spiders, bots, and RSS feed readers have special user agents that set them apart from your standard FireFox, Safari, and IE users. Therefore, by testing the existence of these markers, we can figure out (with good success) who is who.
In addition to testing user agents, you will notice that the first line of my CFIF statement checks for the TestShortSession key in the URL scope. This is a hook for developers to test page requests that have short sessions without having to spoof a spider's user agent.
Then, just to test to make sure this was working, I set up a simple index.cfm page that does a CFDump of the application settings using the undocumented GetApplicationSettings() method:
Launch code in new window » Download code as text file »
Now, running the page as a standard user, we get the following CFDump output:
| | | | ||
| | ![]() | | ||
| | | |
Notice that the SessionTimeout has the value 1200. This is the number of seconds allocated for session timeout (1200 seconds = 20 minutes * 60 seconds). This is just what we want for a standard user.
Now, when we re-run the page, putting ?TestShortSession in the URL, we get the following CFDump output:
| | | | ||
| | ![]() | | ||
| | | |
Notice that this time, the SessionTimeout value is 2 seconds. This is just what we want for spiders and bots so that even if a spider hits your site 10,000 consecutive times, creating a new session each time, at least that memory usage explosion will be very short lived.
Download Code Snippet ZIP File
Comments (17) | Post Comment | Ask Ben | Permalink | Other Searches | Print Page
This is a great idea. I hadn't realized until recently that bots don't carry sessions. I just hadn't thought about it before. However in my sessiontracker app I am building (to be released soon I am recoding it for flex/air) I can see right now 15 active sessions on my blog (10 min in length) and of them 12 are bots. And those are just the bots I identify. There is SO much more bot activity on my sites than I realized. All it takes is one bad one to try to index everything before you have problems.
Posted by Joshua on Dec 13, 2007 at 9:38 AM
@Joshua,
Yeah. And can you imagine something like House Of Fusion which probably has hundreds of thousands of pages?!? The traffic from spiders must just be insane.
Posted by Ben Nadel on Dec 13, 2007 at 10:06 AM
For application here is the same code...ish for application cfcs.
<code>
<cfset this.sessionTimeout="#createtimespan(0,1,0,0)#" />
<!--- Check for bots - give a short lifespan --->
<cfif Len(CGI.HTTP_USER_AGENT) GT 0>
<cfloop list="bot\b,crawl,\brss,feed,news,blog,reader,syndication,coldfusion,slurp,google,zyborg,emonitor,jeeves" index="bot" delimiters=",">
<cfif find(bot, CGI.HTTP_USER_AGENT)>
<cfset this.sessionTimeout="#createtimespan(0,0,0,2)#" />
</cfif>
</cfloop>
<cfelse>
<cfset this.sessionTimeout="#createtimespan(0,0,0,2)#" />
</cfif>
</code>
Posted by Randy Merrill on Dec 13, 2007 at 11:33 AM
@Randy,
Good stuff. I know it's only "code...ish", but my only suggestions would be to make the first session timeout the default "standard user" timeout. That way, we don't need an ELSE statement in our logic. The Bot timeout simply becomes the override for special cases. Also, we could throw a CFBreak tag in the loop if we find a match. The second we find a bot type, we don't need to keep checking.
Posted by Ben Nadel on Dec 13, 2007 at 11:38 AM
Good points... revised:
<cfset this.sessionTimeout="#createtimespan(0,1,0,0)#" />
<!--- Check for bots - give a short lifespan --->
<cfif Len(CGI.HTTP_USER_AGENT)>
<cfloop list="bot\b,crawl,\brss,feed,news,blog,reader,syndication,coldfusion,slurp,google,zyborg,emonitor,jeeves" index="bot" delimiters=",">
<cfif find(bot, CGI.HTTP_USER_AGENT)>
<cfset this.sessionTimeout="#createtimespan(0,0,0,2)#" />
<cfbreak />
</cfif>
</cfloop>
</cfif>
Posted by Randy Merrill on Dec 13, 2007 at 11:46 AM
Niiice.
Posted by Ben Nadel on Dec 13, 2007 at 12:05 PM
Cool post Ben, very interesting. I'm very surprised how making a change in the CFAPPLICATION tag only gets applied to the session for the current request/user. I would have thought that once you set the session timeout to 2 seconds, that EVERYBODY's session would time out.
It would be nice if someone from the CF team could comment on this technique for giving different users different session timeouts. I'm curious if there are any unforeseen negative consequences.
One things that concerns me is that I wonder if it's possible for a "spider" and regular user to execute the CFAPPLICATION block at the same time (using the 2 second timeout), causing both users to get assigned the 2 second session timeout. I know the timeout value is request scoped so the values themselves are safe from each other; I just don't know the internals of CFAPPLICATION as I've never really had to do anything special with it before. Does anybody KNOW the answer to this? Sean?
Maybe I'm just being paranoid.
Posted by Kurt Bonnet on Dec 13, 2007 at 1:38 PM
@Kurt,
You don't have to worry about a spider and a user hitting the CFApplication tag at the exact same time. Remember, applications aren't really "running". Applications are just some chunk of memory that each request associates with using a special key (the app name). As such, each page request isn't really running the CFApplication tag to start the app, each page request is really just associating itself to that application as define in the CFApplication tag.
Then, the session management stuff is assigned to the current user of that page request. So, even if two users hit the tag at the same time, they still get individual results. Don't think of session management as defining the Application... rather, think of it as defining the page request that is associated to the given application.
At least that is how I think about it. I hope I am not totally misleading people here!
Posted by Ben Nadel on Dec 13, 2007 at 1:47 PM
Google only indexes about 55,000 pages a day on House of Fusion. It'll be more once I put the new SEO forums code into effect. I love onMissingTemplate(). :)
(I cover all this in my next FAQU article.)
Posted by Michael Dinowitz on Dec 13, 2007 at 2:42 PM
The bot list WhosOnCFC uses (thank you Joshua) is fairly comprehensive, but no where near complete. I have found that a lot of spiders/bots don't always play nice and mask their user-agent. I know one site in Beijing, China that will generally jump my user count up to around 200 slurping down pages. Easy.
Looking at how you have your application setup, I will probably add it to my major public applications. I was also thinking of ways to work it into WhosOnCFC since you are able to see where the IP address is originating from. Several client's from one IP address is understandable. 200 is something altogether different.
I also had some misgivings about setting two separate session timeouts in the application. Now I think it is definitely something to look into.
Great article Ben.
Posted by Shane Zehnder on Dec 13, 2007 at 6:31 PM
There are only a handfull of bots that really have to be worried about. Blocking the major search engines should be all that's needed.
An alternative to all this is to have a piece of code in the onRequestEnd() that checks if the visitor is a bot and then kills the session. This guarantees that the session will exist as long as the page run did.
Posted by Michael Dinowitz on Dec 13, 2007 at 6:40 PM
I tried this before but then I got users saying they lost their shopping cart so had to turn it off. Ideas?
<cfscript>
isRegVisit = 1;
if(REFind("bot|spider|crawl|google|yahoo|slurp|scooter|lycos|gulliver|infoseek|architext|ia_archiver|crawler|shop|scrubby|teoma|robozilla|nutch|asterias|zyborg|sidewinder", httpUA)) { // it's a search spider. (bot and spider cover many.)
isRegVisit = 0; //use below for session management
}
</cfscript>
<cfapplication name="#request.DS#" clientmanagement="Yes" sessionmanagement="Yes" setclientcookies="Yes" sessiontimeout="#CreateTimeSpan(isRegVisit*1, 0, 0, 1)#">
Posted by ziggy on Dec 13, 2007 at 10:53 PM
@Ziggy,
Have you been able to duplicate the "losing cart" scenario? Or is this what you hear from some users? If it is just a few users, the session timeout is not the issue. Maybe they are not accepting cookies. Maybe they have a really strange User Agent value that is showing up like a spider.
Posted by Ben Nadel on Dec 14, 2007 at 7:08 AM
When, and how, did you discover the getApplicationSettings() method of the App object? It is pretty interesting. For app.cfc, it seems to just mirror the This scope. I'm thinking of filing a ER to ask Adobe to "open" this as a real (documented) method. Ie, something you could use as a function by itself, not off the App scope.
Posted by Raymond Camden on Dec 14, 2007 at 7:04 PM
@Ray,
I discovered it when I was answering a question about testing for session management:
http://www.bennadel.com/index.cfm?dax=blog:943.view
I have a UDF that just dumps out all the Java methods on a given object and I happen to notice it. But certainly, this should totally be a documented function. Seems pretty useful. It's even named like ColdFusion method :)
Posted by Ben Nadel on Dec 14, 2007 at 8:33 PM
>>If it is just a few users, the session timeout is not the issue. Maybe they are not accepting cookies.
Yes, only some users, but enough. I recall they were on regular browsers. All I can say is when I made the change people started complaining. I think it was related to pages with internal redirects, I'm not sure. I couldn't reproduce it or figure it out myself. I put it back and the complaints stopped.
Why don't you use a regex like my code? Seems much tidier.
Posted by ziggy on Dec 15, 2007 at 3:12 AM
@Ziggy,
I hate that! When it's impossible to reproduce something that other people complain about. That's like the most impossible thing to debug; like fighting a ghost.
As for the regular expressions, I actually used to do it your way. But, then I switched to the Find() statements for two reasons:
1. It felt easier to maintain.
2. I have it in my head that short-circuited IF/Find() combinations are faster than regular expressions. I don't quite know if this is fact. A while back, there was a big discussion on which was faster:
http://www.bennadel.com/index.cfm?dax=blog:410.view
Even after just re-reading the comments (which is where the meat of the conversation takes place), I am not 100% I would go one way or the other.
Basically, as the number of spiders gets bigger, I just find the Find() cases easier to read than the "|" statements. At this point, though, especially with the small set of matches, it's really just a preference thing.
Posted by Ben Nadel on Dec 15, 2007 at 2:59 PM