Detecting Spam In User-Submitted Content With SpamAnalyzer.cfc
Fighting SPAM is a never-ending battle. My blog uses a "blacklist" approach that checks user-submitted content against a set of regular expression patterns. For the last 6 years, this has been implemented as a compound CFIF statement containing REFindNoCase() function calls. To-date, this single CFIF statements has close to 2,000 OR operators. Lately, this massive statement has been causing Stack Overflow errors in the ColdFusion compiler:
... and so it goes ...
I've run into this problem before; but now, it's happening much more frequently and is preventing people from posting comments to the blog. To ease the burden of the compiler, I've decided to extract my anti-spam blacklist into its own ColdFusion component that can compile data files down into Java Pattern objects. This makes the logic extremely simple and factors out the regular expression patterns into a single place that can be easily updated.
I've created this SpamAnalyzer.cfc as a project on GitHub.
To instantiate the SpamAnalyzer.cfc, you have to provide it with three file paths:
- User name file path.
- User URL file path.
- User content file path.
Each file should contain Java-compatible regular expressions (more robust that ColdFusion regular expressions). The ColdFusion component expects one pattern per line and will automatically trim each pattern and turn on the ignore-case flag.
Right now, I am only analyzing names, URLs, and content (ie. comments). I shy away from analyzing IP addresses and email addresses as those don't feel like they can be "analyzed".
I have purposely broken the analysis up into three different files because each context contains its own rules. For example, I definitely don't want a user's name to contain the word "ghostwriter"; however, I can't make a hard rule that a user's content shouldn't contain such a phrase. Therefore, each context - name, URL, and content - gets its own set of patterns.
Likewise, each context gets its own analysis method:
- analyzeUserContent( userContent )
- analyzeUserName( userName )
- analyzeUserUrl( userUrl )
Each of these methods returns a "spam report" that has the following keys:
- isSpam - True/False indicating result of analysis.
- inputType - Either "userContent", "userName", or "userUrl".
- input - The user-submitted value being analyzed.
- pattern - The regular expression pattern that cause the content to be flagged as spam.
The nice thing about this, when compared to my massive, compound CFIF statement, is that I will know which pattern caused the user input to be flagged as Spam. Blacklisting is an "art". And, there's no doubt that I get it wrong from time to time. Now, when a "real" user complains that they are being blocked, I'll quickly be able to determine which pattern is causing their submission to fail.
If you want, you can checkout the project on GitHub. I'll be updating the pattern files as the battle continues on my blog.
After reading this, you may be wondering why I don't use some external SPAM API? Simple - for control. I want control over how this stuff works (and I want to able to fix it when it breaks).
Ha ha - "bananas" will certainly always be OK in my book!
So I'm not allowed to rename myself "Sexy Peter" without getting branded a spammer? :(
True - but you'll always be "sexy peter" to us!
Had you reviewed or tried SebTools SpamFilter.CFC before?
If so, how does it compare? Thanks.
I had not heard of that before (at least not that I remember - it looks like it was released a few years ago). Based upon a cursory glance, it looks like the intent is similar in that it also uses regular expression patterns to analyze data. That said, it seems Steve's implementation is tied more to a database and mine is tired more to plain text files.
But, high-level, they appear very similar.
One thing I'd like to do is allow the patterns to be passed into the constructor directly. This way, you can rely on file paths IF you want; or, you can load them from some unknown source and then just pass them in.
Wao Ben really nice and informative topic, I was really searching for spam filtering technique you shared in this post, I will try to convert this technique from ColdFusion to PHP.
Slight detour to this thread:
It looks like you're mainly concerned with automated spam.
The most successful (and easiest) way that I've found to prevent this kind of submission is to put a CSS hidden field in the form and on the server side, make sure that field is blank upon submission.
Automated bots seem to just fill out every form field because they're not sure which ones are required. This is completely hidden from the user and works nearly 100% of the time.
Word up, I believe that is known as the "honey pot" approach. I definitely believe in that whole-heartedly! I suggest a combination of approaches since spammers seem to be unstoppable :D
So I'm confused. Is Akismet no longer a good enough spam detector for comments? I use Akismet for my site and client sites and have never had a comment spam issue. Plus, there's tons of plugins to increase site security and other related stuff.
Akismet never ceases to amaze me. It's scary looking at the logs to see how many times my sites are hit up each day.
Also "Sexy Peter." That's funny. LOL.
Off subject, has anyone noticed the insane Facebook spam going on lately? In particular, Facebook only seems to suggest friends that are fake accounts. It's obvious they're fake because they are all pictures of super hot girls whom I don't know.
I've noticed it too. Also, a lot more spam seems to be getting through the Gmail filter.
Hi there, this blog is very interesting thanks for sharing information. we can't predict when the evils like spam will come to our inbox.