Normalizing 0xA0 (No-Break Space) And Other Special Characters Within ColdFusion Form Posts

Published 2022-02-05 in ColdFusion — Comments (1)

Yesterday, I was trying to clean-up some formatting in my comments data-table when I noticed that a lot of comments contained a funky character, <0xA0>. I looked this up in the Unicode Character Table and it turns out to be a No-Break Space. Apparently, some text-editors will just randomly inject this character? Well, I don't want this character in my comments. And, for that matter, I don't want other special characters like "smart quotes" and "bullets" either. As such, I took some time to make my ColdFusion form scope pre-processing a bit more robust in my ColdFusion 2021 blogging platform.

Last month, I announced that you can now post comments on this blog via an email reply. As part of that feature enhancement, I had to do a bunch of plain-text parsing on the inbound email body coming from PostMark. That plain-text parsing performed a lot of "text normalization" in order to make subsequent parsing easier. But, it also happens to overlap heavily with the type of normalization that I want to do on my ColdFusion form posts.

So, I took the "normalization" portion of that ColdFusion component and I factored-it-out into a new component that I could then reused in multiple contexts: TextNormalization.cfc. This has public methods for normalizing different classes of special characters:

normalizeBullets()
normalizeDashes()
normalizeDoubleQuotes()
normalizeLineEndings()
normalizeSingleQuotes()
normalizeSpaces()

It also has a method that applies all of the normalization methods to a given value:

component {

	/**
	* I apply all the normalization methods to the given value and return the result.
	*/
	public string function normalizeText( required string value ) {

		var result = trim( value );
		result = normalizeLineEndings( result );
		result = normalizeSpaces( result );
		result = normalizeDoubleQuotes( result );
		result = normalizeSingleQuotes( result );
		result = normalizeDashes( result );
		result = normalizeBullets( result );

		return( result );

	}

}

With this ColdFusion component in place, I then loop over the form scope at the top of each request using the onRequestStart() event-handler in my Application.cfc ColdFusion application framework component:

component {

	/**
	* I get called once at the start of each incoming ColdFusion request.
	*/
	public void function onRequestStart() {

		for ( var key in form ) {

			if ( isSimpleValue( form[ key ] ) ) {

				form[ key ] = textNormalization.normalizeText( form[ key ] );

			}

		}

	}

}

The actual implementation of the TextNormalization.cfc ColdFusion component uses Regular Expression replacements and the verbose flag (?x) in conjunction with character classes to quickly replace several different characters at the same time. Here is my implementation:

/**
* The site - https://unicode-table.com/ - is great for looking up Unicode values.
*/
component
	output = false
	hint = "I provide methods for normalizing special characters within text values."
	{

	// --
	// PUBLIC METHODS.
	// --

	/**
	* I replace special bullets with the standard asterisk.
	*/
	public string function normalizeBullets( required string value ) {

		return(
			jreReplace(
				value,
				"(?x)[
					\u2022  ## Bullet.
					\u2023  ## Triangular Bullet.
					\u2043  ## Hyphen Bullet.
					\u2219  ## Bullet Operator.
					\u25aa  ## Black Small Square Emoji.
					\u25cb  ## White Circle.
					\u25cf  ## Black Circle.
					\u25e6  ## White Bullet.
				]",
				"*"
			)
		);

		return( content );

	}


	/**
	* I replace like-sized dashes with standard dashes.
	*/
	public string function normalizeDashes( required string value ) {

		return(
			jreReplace(
				value,
				"(?x)[
					\u2013  ## En Dash.
					\u2212  ## Minus Sign.
				]",
				"-"
			)
		);

	}


	/**
	* I replace "smart double quotes" with standard double quotes.
	*/
	public string function normalizeDoubleQuotes( required string value ) {

		return(
			jreReplace(
				value,
				"(?x)[
					\u201c  ## Left Double Quotation Mark.
					\u201d  ## Right Double Quotation Mark.
					\u201e  ## Double Low-9 Quotation Mark.
					\u201f  ## Double High-Reversed-9 Quotation Mark.
					\u275d  ## Heavy Double Turned Comma Quotation Mark Ornament.
					\u275e  ## Heavy Double Comma Quotation Mark Ornament.
					\u2e42  ## Double Low-Reversed-9 Quotation Mark.
					\u301d  ## Reversed Double Prime Quotation Mark.
					\u301e  ## Double Prime Quotation Mark.
					\u301f  ## Low Double Prime Quotation Mark.
					\uff02  ## Fullwidth Quotation Mark.
				]",
				""""
			)
		);

	}


	/**
	* I convert all the line-breaks to NewLine characters.
	*/
	public string function normalizeLineEndings( required string value ) {

		return( jreReplace( value, "\r\n?", chr( 10 ) ) );

	}


	/**
	* I replace "smart single quotes" with standard single quotes.
	*/
	public string function normalizeSingleQuotes( required string value ) {

		return(
			jreReplace(
				value,
				"(?x)[
					\u2018  ## Left Single Quotation Mark.
					\u2019  ## Right Single Quotation Mark.
					\u201a  ## Single Low-9 Quotation Mark.
					\u201b  ## Single High-Reversed-9 Quotation Mark.
					\u275b  ## Heavy Single Turned Comma Quotation Mark Ornament.
					\u275c  ## Heavy Single Comma Quotation Mark Ornament.
					\u275f  ## Heavy Low Single Comma Quotation Mark Ornament.
				]",
				"'"
			)
		);

	}


	/**
	* I convert any special spaces to regular spaces.
	*/
	public string function normalizeSpaces( required string value ) {

		return(
			jreReplace(
				value,
				"(?x)[
					\u00a0  ## No-Break Space.
					\u2000  ## En Quad (space that is one en wide).
					\u2001  ## Em Quad (space that is one em wide).
					\u2002  ## En Space.
					\u2003  ## Em Space.
					\u2004  ## Thick Space.
					\u2005  ## Mid Space.
					\u2006  ## Six-Per-Em Space.
					\u2007  ## Figure Space.
					\u2008  ## Punctuation Space.
					\u2009  ## Thin Space.
					\u200a  ## Hair Space.
					\u200b  ## Zero Width Space.
					\u2028  ## Line Separator.
					\u2029  ## Paragraph Separator.
					\u202f  ## Narrow No-Break Space.
					\ufeff  ## Zero Width No-Break Space.
				]",
				" "
			)
		);

	}


	/**
	* I apply all the normalization methods to the given value and return the result.
	*/
	public string function normalizeText( required string value ) {

		var result = trim( value );
		result = normalizeLineEndings( result );
		result = normalizeSpaces( result );
		result = normalizeDoubleQuotes( result );
		result = normalizeSingleQuotes( result );
		result = normalizeDashes( result );
		result = normalizeBullets( result );

		return( result );

	}

	// --
	// PRIVATE METHODS.
	// --

	/**
	* I use Java's Pattern engine to perform a RegEx replace on the given input.
	*/
	private string function jreReplace(
		required string input,
		required string pattern,
		string replacement = ""
		) {

		var result = javaCast( "string", input ).replaceAll(
			javaCast( "string", pattern ),
			javaCast( "string", replacement )
		);

		return( result );

	}

}

ASIDE: If you've been following this blog for any period of time, you'll likely notice that Regular Expressions (RegEx) come up over and over again as a part of the solution space. RegEx and pattern matching is such a powerful tool. And this is just a reminder to take the time and learn-up on them - it's perhaps a topic that has some of the highest ROI (Return on Investment) that you can make.

VIDEO PRESENTATION: Regular Expressions, Extraordinary Power »

Now that I have this TextNormalization.cfc ColdFusion component, I can use it in both my onRequestStart() event-handler and my inbound email parser. Which is just another reminder that you shouldn't create premature abstractions until you start to see duplication in your code that might be an indication of a possible abstraction opportunity.

Am I missing any special characters that people normalize out? If so, please let me know in the comments.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/4199