Detecting Spam In User-Submitted Content With SpamAnalyzer.cfc

Posted September 25, 2012 at 9:34 AM by Ben Nadel

Tags: ColdFusion

Fighting SPAM is a never-ending battle. My blog uses a "blacklist" approach that checks user-submitted content against a set of regular expression patterns. For the last 6 years, this has been implemented as a compound CFIF statement containing REFindNoCase() function calls. To-date, this single CFIF statements has close to 2,000 OR operators. Lately, this massive statement has been causing Stack Overflow errors in the ColdFusion compiler:

StackTrace: java.lang.StackOverflowError
at coldfusion.compiler.ExprNode.subexpr(ExprNode.java:39)
at coldfusion.compiler.ExprAssembler.assembleExpr(ExprAssembler.java:345)
at coldfusion.compiler.ExprAssembler.cast(ExprAssembler.java:1342)
at coldfusion.compiler.StmtAssembler.cast(StmtAssembler.java:406)
at coldfusion.compiler.ExprAssembler.assembleExpr(ExprAssembler.java:187)
at coldfusion.compiler.ExprAssembler.assembleExpr(ExprAssembler.java:345)
at coldfusion.compiler.ExprAssembler.cast(ExprAssembler.java:1342)
at coldfusion.compiler.StmtAssembler.cast(StmtAssembler.java:406)
... and so it goes ...

I've run into this problem before; but now, it's happening much more frequently and is preventing people from posting comments to the blog. To ease the burden of the compiler, I've decided to extract my anti-spam blacklist into its own ColdFusion component that can compile data files down into Java Pattern objects. This makes the logic extremely simple and factors out the regular expression patterns into a single place that can be easily updated.

I've created this SpamAnalyzer.cfc as a project on GitHub.

To instantiate the SpamAnalyzer.cfc, you have to provide it with three file paths:

  • User name file path.
  • User URL file path.
  • User content file path.

Each file should contain Java-compatible regular expressions (more robust that ColdFusion regular expressions). The ColdFusion component expects one pattern per line and will automatically trim each pattern and turn on the ignore-case flag.

Right now, I am only analyzing names, URLs, and content (ie. comments). I shy away from analyzing IP addresses and email addresses as those don't feel like they can be "analyzed".

I have purposely broken the analysis up into three different files because each context contains its own rules. For example, I definitely don't want a user's name to contain the word "ghostwriter"; however, I can't make a hard rule that a user's content shouldn't contain such a phrase. Therefore, each context - name, URL, and content - gets its own set of patterns.

Likewise, each context gets its own analysis method:

  • analyzeUserContent( userContent )
  • analyzeUserName( userName )
  • analyzeUserUrl( userUrl )

Each of these methods returns a "spam report" that has the following keys:

  • isSpam - True/False indicating result of analysis.
  • inputType - Either "userContent", "userName", or "userUrl".
  • input - The user-submitted value being analyzed.
  • pattern - The regular expression pattern that cause the content to be flagged as spam.

The nice thing about this, when compared to my massive, compound CFIF statement, is that I will know which pattern caused the user input to be flagged as Spam. Blacklisting is an "art". And, there's no doubt that I get it wrong from time to time. Now, when a "real" user complains that they are being blocked, I'll quickly be able to determine which pattern is causing their submission to fail.

If you want, you can checkout the project on GitHub. I'll be updating the pattern files as the battle continues on my blog.

After reading this, you may be wondering why I don't use some external SPAM API? Simple - for control. I want control over how this stuff works (and I want to able to fix it when it breaks).




Reader Comments

Sep 25, 2012 at 9:43 AM // reply »
36 Comments

Bananas


Sep 25, 2012 at 9:47 AM // reply »
11,241 Comments

@Joshua,

Ha ha - "bananas" will certainly always be OK in my book!


Sep 25, 2012 at 10:25 AM // reply »
49 Comments

So I'm not allowed to rename myself "Sexy Peter" without getting branded a spammer? :(


Sep 25, 2012 at 10:35 AM // reply »
11,241 Comments

@Peter,

True - but you'll always be "sexy peter" to us!


Sep 25, 2012 at 11:57 AM // reply »
46 Comments

Had you reviewed or tried SebTools SpamFilter.CFC before?
http://www.bryantwebconsulting.com/blog/index.cfm/SpamFilter

If so, how does it compare? Thanks.


Sep 25, 2012 at 1:03 PM // reply »
11,241 Comments

@James,

I had not heard of that before (at least not that I remember - it looks like it was released a few years ago). Based upon a cursory glance, it looks like the intent is similar in that it also uses regular expression patterns to analyze data. That said, it seems Steve's implementation is tied more to a database and mine is tired more to plain text files.

But, high-level, they appear very similar.

One thing I'd like to do is allow the patterns to be passed into the constructor directly. This way, you can rely on file paths IF you want; or, you can load them from some unknown source and then just pass them in.


Oct 16, 2012 at 8:55 AM // reply »
1 Comments

Wao Ben really nice and informative topic, I was really searching for spam filtering technique you shared in this post, I will try to convert this technique from ColdFusion to PHP.
Thanks,
Wisdomsol.net


Oct 18, 2012 at 12:23 PM // reply »
28 Comments

Slight detour to this thread:

It looks like you're mainly concerned with automated spam.

The most successful (and easiest) way that I've found to prevent this kind of submission is to put a CSS hidden field in the form and on the server side, make sure that field is blank upon submission.

HTML:

  •  
  • [input type="text" style="display:none;" name="city"]

CFM:

  •  
  • [cfif len(form.city) eq 0]
  • [!--- Not spam - submit form ---]
  • [/cfif]

Automated bots seem to just fill out every form field because they're not sure which ones are required. This is completely hidden from the user and works nearly 100% of the time.


Oct 27, 2012 at 5:56 PM // reply »
11,241 Comments

@Aaron,

Word up, I believe that is known as the "honey pot" approach. I definitely believe in that whole-heartedly! I suggest a combination of approaches since spammers seem to be unstoppable :D


Dec 5, 2012 at 1:19 PM // reply »
1 Comments

So I'm confused. Is Akismet no longer a good enough spam detector for comments? I use Akismet for my site and client sites and have never had a comment spam issue. Plus, there's tons of plugins to increase site security and other related stuff.

Akismet never ceases to amaze me. It's scary looking at the logs to see how many times my sites are hit up each day.

Also "Sexy Peter." That's funny. LOL.

Off subject, has anyone noticed the insane Facebook spam going on lately? In particular, Facebook only seems to suggest friends that are fake accounts. It's obvious they're fake because they are all pictures of super hot girls whom I don't know.


Dec 5, 2012 at 1:26 PM // reply »
28 Comments

@Farah,

I've noticed it too. Also, a lot more spam seems to be getting through the Gmail filter.


Dec 24, 2012 at 1:48 AM // reply »
1 Comments

Hi there, this blog is very interesting thanks for sharing information. we can't predict when the evils like spam will come to our inbox.


Post A Comment

Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.

Please review the following issues:

Author Name:


Author Email:

Author Website:

Comment:

Supported HTML tags for formatting: <strong>bold</strong>   <em>italic</em>   <code>code</code>







  • Help Wanted - Find Your Next ColdFusion Job
Ben Nadel's Company - Epicenter Consulting Recent Blog Comments
May 22, 2013 at 12:44 PM
Ask Ben: Query Loop Inside CFScript Tags
In cf10, if you call a function that has: local.result = {}; local.result.msg = ""; local.svc = new query(); local.svc.setSQL("SELECT * FROM..."); local.obj = local.svc.exe ... read »
May 22, 2013 at 12:29 PM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
@Ben: What version of Java are you using? Also, did you test users.id to see what Java reports as the data type? I wonder if it's not a Java primitive data type, but getting returned as something ... read »
May 22, 2013 at 11:47 AM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
@Dana, Awesome - so it looks like this bug was fixed in ColdFusion 10. Thanks so much for double-checking that. ... read »
May 22, 2013 at 11:37 AM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
When I c&p and run on cf10, I get: Selected User IDs: 1,4 User 1 selected: YES - YES User 2 selected: NO - NO User 3 selected: NO - NO User 4 selected: YES - YES User 5 selected: NO - ... read »
May 22, 2013 at 11:27 AM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
@Tom, Good thought, but no dice. Both of these still exhibit the same behavior: users.id[ users.currentRow ] users[ "id" ][ users.currentRow ] It's just something whacky happening with ... read »
May 22, 2013 at 11:07 AM
Strange Interaction Between DeserializeJson(), ArrayContains(), And Database Values In ColdFusion
Could your problem be that "users.id" is actually an ARRAY, not a single value? Perhaps try it again with "users.id[1]" (I only have CF8 here at work). ... read »
May 22, 2013 at 7:52 AM
Nested Views, Routing, And Deep Linking With AngularJS
Hi, Just a quick thank you. As it happens, for my own purposes, the pending ui-router work being done in native angular is likely the one I'll adopt, but your exploration, code and documentation of ... read »
May 22, 2013 at 4:43 AM
How Do You Use The ColdFusion CFParam Tag?
'<cfparam>' or 'isDefined()and <cfset>' performs the same task.Is there any difference? ... read »
InVision App - Prototyping Made Beautiful With Prototyping Tools