Fighting SPAM is a never-ending battle. My blog uses a "blacklist" approach that checks user-submitted content against a set of regular expression patterns. For the last 6 years, this has been implemented as a compound CFIF statement containing REFindNoCase() function calls. To-date, this single CFIF statements has close to 2,000 OR operators. Lately, this massive statement has been causing Stack Overflow errors in the ColdFusion compiler:
... and so it goes ...
I've run into this problem before; but now, it's happening much more frequently and is preventing people from posting comments to the blog. To ease the burden of the compiler, I've decided to extract my anti-spam blacklist into its own ColdFusion component that can compile data files down into Java Pattern objects. This makes the logic extremely simple and factors out the regular expression patterns into a single place that can be easily updated.
I've created this SpamAnalyzer.cfc as a project on GitHub.
To instantiate the SpamAnalyzer.cfc, you have to provide it with three file paths:
- User name file path.
- User URL file path.
- User content file path.
Each file should contain Java-compatible regular expressions (more robust that ColdFusion regular expressions). The ColdFusion component expects one pattern per line and will automatically trim each pattern and turn on the ignore-case flag.
Right now, I am only analyzing names, URLs, and content (ie. comments). I shy away from analyzing IP addresses and email addresses as those don't feel like they can be "analyzed".
I have purposely broken the analysis up into three different files because each context contains its own rules. For example, I definitely don't want a user's name to contain the word "ghostwriter"; however, I can't make a hard rule that a user's content shouldn't contain such a phrase. Therefore, each context - name, URL, and content - gets its own set of patterns.
Likewise, each context gets its own analysis method:
- analyzeUserContent( userContent )
- analyzeUserName( userName )
- analyzeUserUrl( userUrl )
Each of these methods returns a "spam report" that has the following keys:
- isSpam - True/False indicating result of analysis.
- inputType - Either "userContent", "userName", or "userUrl".
- input - The user-submitted value being analyzed.
- pattern - The regular expression pattern that cause the content to be flagged as spam.
The nice thing about this, when compared to my massive, compound CFIF statement, is that I will know which pattern caused the user input to be flagged as Spam. Blacklisting is an "art". And, there's no doubt that I get it wrong from time to time. Now, when a "real" user complains that they are being blocked, I'll quickly be able to determine which pattern is causing their submission to fail.
If you want, you can checkout the project on GitHub. I'll be updating the pattern files as the battle continues on my blog.
After reading this, you may be wondering why I don't use some external SPAM API? Simple - for control. I want control over how this stuff works (and I want to able to fix it when it breaks).