Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
I am the chief technical officer at InVision App, Inc - a prototyping and collaboration platform for designers, built by designers. I also rock out in JavaScript and ColdFusion 24x7.
Meanwhile on Twitter
Loading latest tweet...
Ben Nadel at RIA Unleashed (Nov. 2010) with:

Detecting Spam In User-Submitted Content With SpamAnalyzer.cfc

By Ben Nadel on
Tags: ColdFusion

Fighting SPAM is a never-ending battle. My blog uses a "blacklist" approach that checks user-submitted content against a set of regular expression patterns. For the last 6 years, this has been implemented as a compound CFIF statement containing REFindNoCase() function calls. To-date, this single CFIF statements has close to 2,000 OR operators. Lately, this massive statement has been causing Stack Overflow errors in the ColdFusion compiler:

StackTrace: java.lang.StackOverflowError
at coldfusion.compiler.ExprNode.subexpr(ExprNode.java:39)
at coldfusion.compiler.ExprAssembler.assembleExpr(ExprAssembler.java:345)
at coldfusion.compiler.ExprAssembler.cast(ExprAssembler.java:1342)
at coldfusion.compiler.StmtAssembler.cast(StmtAssembler.java:406)
at coldfusion.compiler.ExprAssembler.assembleExpr(ExprAssembler.java:187)
at coldfusion.compiler.ExprAssembler.assembleExpr(ExprAssembler.java:345)
at coldfusion.compiler.ExprAssembler.cast(ExprAssembler.java:1342)
at coldfusion.compiler.StmtAssembler.cast(StmtAssembler.java:406)
... and so it goes ...

I've run into this problem before; but now, it's happening much more frequently and is preventing people from posting comments to the blog. To ease the burden of the compiler, I've decided to extract my anti-spam blacklist into its own ColdFusion component that can compile data files down into Java Pattern objects. This makes the logic extremely simple and factors out the regular expression patterns into a single place that can be easily updated.

I've created this SpamAnalyzer.cfc as a project on GitHub.

To instantiate the SpamAnalyzer.cfc, you have to provide it with three file paths:

  • User name file path.
  • User URL file path.
  • User content file path.

Each file should contain Java-compatible regular expressions (more robust that ColdFusion regular expressions). The ColdFusion component expects one pattern per line and will automatically trim each pattern and turn on the ignore-case flag.

Right now, I am only analyzing names, URLs, and content (ie. comments). I shy away from analyzing IP addresses and email addresses as those don't feel like they can be "analyzed".

I have purposely broken the analysis up into three different files because each context contains its own rules. For example, I definitely don't want a user's name to contain the word "ghostwriter"; however, I can't make a hard rule that a user's content shouldn't contain such a phrase. Therefore, each context - name, URL, and content - gets its own set of patterns.

Likewise, each context gets its own analysis method:

  • analyzeUserContent( userContent )
  • analyzeUserName( userName )
  • analyzeUserUrl( userUrl )

Each of these methods returns a "spam report" that has the following keys:

  • isSpam - True/False indicating result of analysis.
  • inputType - Either "userContent", "userName", or "userUrl".
  • input - The user-submitted value being analyzed.
  • pattern - The regular expression pattern that cause the content to be flagged as spam.

The nice thing about this, when compared to my massive, compound CFIF statement, is that I will know which pattern caused the user input to be flagged as Spam. Blacklisting is an "art". And, there's no doubt that I get it wrong from time to time. Now, when a "real" user complains that they are being blocked, I'll quickly be able to determine which pattern is causing their submission to fail.

If you want, you can checkout the project on GitHub. I'll be updating the pattern files as the battle continues on my blog.

After reading this, you may be wondering why I don't use some external SPAM API? Simple - for control. I want control over how this stuff works (and I want to able to fix it when it breaks).




Reader Comments

@James,

I had not heard of that before (at least not that I remember - it looks like it was released a few years ago). Based upon a cursory glance, it looks like the intent is similar in that it also uses regular expression patterns to analyze data. That said, it seems Steve's implementation is tied more to a database and mine is tired more to plain text files.

But, high-level, they appear very similar.

One thing I'd like to do is allow the patterns to be passed into the constructor directly. This way, you can rely on file paths IF you want; or, you can load them from some unknown source and then just pass them in.

Reply to this Comment

Wao Ben really nice and informative topic, I was really searching for spam filtering technique you shared in this post, I will try to convert this technique from ColdFusion to PHP.
Thanks,
Wisdomsol.net

Reply to this Comment

Slight detour to this thread:

It looks like you're mainly concerned with automated spam.

The most successful (and easiest) way that I've found to prevent this kind of submission is to put a CSS hidden field in the form and on the server side, make sure that field is blank upon submission.

HTML:

  •  
  • [input type="text" style="display:none;" name="city"]

CFM:

  •  
  • [cfif len(form.city) eq 0]
  • [!--- Not spam - submit form ---]
  • [/cfif]

Automated bots seem to just fill out every form field because they're not sure which ones are required. This is completely hidden from the user and works nearly 100% of the time.

Reply to this Comment

@Aaron,

Word up, I believe that is known as the "honey pot" approach. I definitely believe in that whole-heartedly! I suggest a combination of approaches since spammers seem to be unstoppable :D

Reply to this Comment

So I'm confused. Is Akismet no longer a good enough spam detector for comments? I use Akismet for my site and client sites and have never had a comment spam issue. Plus, there's tons of plugins to increase site security and other related stuff.

Akismet never ceases to amaze me. It's scary looking at the logs to see how many times my sites are hit up each day.

Also "Sexy Peter." That's funny. LOL.

Off subject, has anyone noticed the insane Facebook spam going on lately? In particular, Facebook only seems to suggest friends that are fake accounts. It's obvious they're fake because they are all pictures of super hot girls whom I don't know.

Reply to this Comment

@Farah,

I've noticed it too. Also, a lot more spam seems to be getting through the Gmail filter.

Reply to this Comment

Hi there, this blog is very interesting thanks for sharing information. we can't predict when the evils like spam will come to our inbox.

Reply to this Comment

Post A Comment

You — Get Out Of My Dreams, Get Into My Comments
Live in the Now
Oops!
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.