Detecting Spam In User-Submitted Content With SpamAnalyzer.cfc

Posted September 25, 2012 at 9:34 AM by Ben Nadel

Tags: ColdFusion

Fighting SPAM is a never-ending battle. My blog uses a "blacklist" approach that checks user-submitted content against a set of regular expression patterns. For the last 6 years, this has been implemented as a compound CFIF statement containing REFindNoCase() function calls. To-date, this single CFIF statements has close to 2,000 OR operators. Lately, this massive statement has been causing Stack Overflow errors in the ColdFusion compiler:

StackTrace: java.lang.StackOverflowError
at coldfusion.compiler.ExprNode.subexpr(ExprNode.java:39)
at coldfusion.compiler.ExprAssembler.assembleExpr(ExprAssembler.java:345)
at coldfusion.compiler.ExprAssembler.cast(ExprAssembler.java:1342)
at coldfusion.compiler.StmtAssembler.cast(StmtAssembler.java:406)
at coldfusion.compiler.ExprAssembler.assembleExpr(ExprAssembler.java:187)
at coldfusion.compiler.ExprAssembler.assembleExpr(ExprAssembler.java:345)
at coldfusion.compiler.ExprAssembler.cast(ExprAssembler.java:1342)
at coldfusion.compiler.StmtAssembler.cast(StmtAssembler.java:406)
... and so it goes ...

I've run into this problem before; but now, it's happening much more frequently and is preventing people from posting comments to the blog. To ease the burden of the compiler, I've decided to extract my anti-spam blacklist into its own ColdFusion component that can compile data files down into Java Pattern objects. This makes the logic extremely simple and factors out the regular expression patterns into a single place that can be easily updated.

I've created this SpamAnalyzer.cfc as a project on GitHub.

To instantiate the SpamAnalyzer.cfc, you have to provide it with three file paths:

  • User name file path.
  • User URL file path.
  • User content file path.

Each file should contain Java-compatible regular expressions (more robust that ColdFusion regular expressions). The ColdFusion component expects one pattern per line and will automatically trim each pattern and turn on the ignore-case flag.

Right now, I am only analyzing names, URLs, and content (ie. comments). I shy away from analyzing IP addresses and email addresses as those don't feel like they can be "analyzed".

I have purposely broken the analysis up into three different files because each context contains its own rules. For example, I definitely don't want a user's name to contain the word "ghostwriter"; however, I can't make a hard rule that a user's content shouldn't contain such a phrase. Therefore, each context - name, URL, and content - gets its own set of patterns.

Likewise, each context gets its own analysis method:

  • analyzeUserContent( userContent )
  • analyzeUserName( userName )
  • analyzeUserUrl( userUrl )

Each of these methods returns a "spam report" that has the following keys:

  • isSpam - True/False indicating result of analysis.
  • inputType - Either "userContent", "userName", or "userUrl".
  • input - The user-submitted value being analyzed.
  • pattern - The regular expression pattern that cause the content to be flagged as spam.

The nice thing about this, when compared to my massive, compound CFIF statement, is that I will know which pattern caused the user input to be flagged as Spam. Blacklisting is an "art". And, there's no doubt that I get it wrong from time to time. Now, when a "real" user complains that they are being blocked, I'll quickly be able to determine which pattern is causing their submission to fail.

If you want, you can checkout the project on GitHub. I'll be updating the pattern files as the battle continues on my blog.

After reading this, you may be wondering why I don't use some external SPAM API? Simple - for control. I want control over how this stuff works (and I want to able to fix it when it breaks).




Reader Comments

Sep 25, 2012 at 9:43 AM // reply »
36 Comments

Bananas


Sep 25, 2012 at 9:47 AM // reply »
11,246 Comments

@Joshua,

Ha ha - "bananas" will certainly always be OK in my book!


Sep 25, 2012 at 10:25 AM // reply »
49 Comments

So I'm not allowed to rename myself "Sexy Peter" without getting branded a spammer? :(


Sep 25, 2012 at 10:35 AM // reply »
11,246 Comments

@Peter,

True - but you'll always be "sexy peter" to us!


Sep 25, 2012 at 11:57 AM // reply »
46 Comments

Had you reviewed or tried SebTools SpamFilter.CFC before?
http://www.bryantwebconsulting.com/blog/index.cfm/SpamFilter

If so, how does it compare? Thanks.


Sep 25, 2012 at 1:03 PM // reply »
11,246 Comments

@James,

I had not heard of that before (at least not that I remember - it looks like it was released a few years ago). Based upon a cursory glance, it looks like the intent is similar in that it also uses regular expression patterns to analyze data. That said, it seems Steve's implementation is tied more to a database and mine is tired more to plain text files.

But, high-level, they appear very similar.

One thing I'd like to do is allow the patterns to be passed into the constructor directly. This way, you can rely on file paths IF you want; or, you can load them from some unknown source and then just pass them in.


Oct 16, 2012 at 8:55 AM // reply »
1 Comments

Wao Ben really nice and informative topic, I was really searching for spam filtering technique you shared in this post, I will try to convert this technique from ColdFusion to PHP.
Thanks,
Wisdomsol.net


Oct 18, 2012 at 12:23 PM // reply »
28 Comments

Slight detour to this thread:

It looks like you're mainly concerned with automated spam.

The most successful (and easiest) way that I've found to prevent this kind of submission is to put a CSS hidden field in the form and on the server side, make sure that field is blank upon submission.

HTML:

  •  
  • [input type="text" style="display:none;" name="city"]

CFM:

  •  
  • [cfif len(form.city) eq 0]
  • [!--- Not spam - submit form ---]
  • [/cfif]

Automated bots seem to just fill out every form field because they're not sure which ones are required. This is completely hidden from the user and works nearly 100% of the time.


Oct 27, 2012 at 5:56 PM // reply »
11,246 Comments

@Aaron,

Word up, I believe that is known as the "honey pot" approach. I definitely believe in that whole-heartedly! I suggest a combination of approaches since spammers seem to be unstoppable :D


Dec 5, 2012 at 1:19 PM // reply »
1 Comments

So I'm confused. Is Akismet no longer a good enough spam detector for comments? I use Akismet for my site and client sites and have never had a comment spam issue. Plus, there's tons of plugins to increase site security and other related stuff.

Akismet never ceases to amaze me. It's scary looking at the logs to see how many times my sites are hit up each day.

Also "Sexy Peter." That's funny. LOL.

Off subject, has anyone noticed the insane Facebook spam going on lately? In particular, Facebook only seems to suggest friends that are fake accounts. It's obvious they're fake because they are all pictures of super hot girls whom I don't know.


Dec 5, 2012 at 1:26 PM // reply »
28 Comments

@Farah,

I've noticed it too. Also, a lot more spam seems to be getting through the Gmail filter.


Dec 24, 2012 at 1:48 AM // reply »
1 Comments

Hi there, this blog is very interesting thanks for sharing information. we can't predict when the evils like spam will come to our inbox.


Post A Comment

Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.

Please review the following issues:

Author Name:


Author Email:

Author Website:

Comment:

Supported HTML tags for formatting: <strong>bold</strong>   <em>italic</em>   <code>code</code>







  • Help Wanted - Find Your Next ColdFusion Job
Ben Nadel's Company - Epicenter Consulting Recent Blog Comments
May 23, 2013 at 9:52 PM
Preventing Links In Standalone iPhone Applications From Opening In Mobile Safari
@Muhmmadibn Did you figure out a solution to launching PDFs? I am running into the same issues myself. There is no way to close the PDF or go back once you launch it. Thanks in advance! ... read »
May 23, 2013 at 6:06 PM
The Girl Who Broke My Heart, And Made Me A Better Person
Good day,ladies and gentle men, my name is Dr AMADI the great spell caster in Africa, i have help so many people for different kind of problems,who say there is no solution to problems on earth, that ... read »
May 23, 2013 at 4:26 PM
ColdFusion QueryAppend( qOne, qTwo )
@Heather, Glad people are still getting value out of this! ... read »
May 23, 2013 at 3:49 PM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
@WebManWalking, I meant the code at the bottom (not the video). I did try to experiment with an intermediary variable, like: value = users.id[ i ]; arrayContains( userIDs, value ); ... but t ... read »
May 23, 2013 at 11:06 AM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
@Ben, Are you talking about As Number: YES As String: YES As Java: YES? If so, that's with 3 different ways of referencing the constant 1, not users.id[1]. Query object references(*) are what seem ... read »
May 23, 2013 at 9:55 AM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
@Dan, According to the CF Admin, I'm running Java "1.6.0_45". As far as the DB column, in the database it's an INT. I'll see if I can dig into what CF sees it as. @WebManWalking, But h ... read »
May 23, 2013 at 9:49 AM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
@Ben, I think the problem is that we're used to loose typing in ColdFusion, like JavaScript. If a value is a number but it's needed in an expression to be a string, noooo problem. I've encountered ... read »
May 23, 2013 at 9:47 AM
ColdFusion QueryAppend( qOne, qTwo )
You rock! Thank you, thank you, thank you!!! ... read »
InVision App - Prototyping Made Beautiful With Prototyping Tools