Stripping Supplementary Characters Out Of A String In Order To Make It MySQL UTF-8 Compatible

By Ben Nadel on April 18, 2016

At work, we have a MySQL table that uses the "utf8" character set. This encoding supports all of the Basic Multilingual Plane (BMP) characters; but, it doesn't support Supplementary characters (ie, those in the astral plane that are represented by high / low surrogate pairs). For that, you need to configure the table data with "utf8mb4" to allow for a maximum of four bytes per character. We could ALTER the table. But, it's a massive table and this task hasn't been "owned" by any of our teams. As such, I wanted to see if I could find a way to strip supplementary characters out of the string in order to make it MySQL "utf8" compatible in the mean time.

Why the hassle? Emoticons.

Emoticons are the most commonly used supplementary characters. And, if you try to insert an emoticon into a table that only supports "utf8", MySQL throws an error. So, for example, if I tried to read in the following file:

String with emoticon / supplementary characters embedded.

... and insert it into a MySQL table that only supported "utf8", MySQL would throw the following error (in a ColdFusion context):

Error Executing Database Query. Incorrect string value: '\xF0\x9F\x98\x8D, ...' for column 'value' at row 1.

In a perfect world, I'd just have a table that uses "utf8mb4". But, this is not a perfect world and I'd rather have a value that can be inserted into the database, even if it's missing a thumbs-up or a smiley face. So, for the time being perhaps I can just replace any supplementary characters with something like, "[?]". It's not the most thrilling user experience (UX); but, it's definitely better than a "500 Server Error."

Before I started looking into this, I really didn't know much of anything about character planes or character encoding. So, I'd like to give a huge shout-out to Mathias Bynens for two of his in-depth articles about UTF-8 encoding in MySQL and JavaScript:

While my approach uses ColdFusion (which is really just using Java 1.7), I wouldn't have been able to get this far without his excellent write-ups.

That said, in the following code, I'm going to read in the above text file, strip out the supplementary characters, and insert the result into the database. The bulk of the work is done by the replaceSupplementaryCodePoints() function which accepts a string and a callback. Internally, the function will iterate over the code-points (each of which may or may not contain a surrogate pair) and invoke the callback for the supplementary characters.

<cfscript>

	// Read in a message that contains supplementary code-points.
	message = trim( fileRead( expandPath( "./messages/story.txt" ), "utf-8" ) );

	// We only have a table that supports utf8 characters, not utf8mb4. As such, we will
	// need to strip out or replace the "supplementary code-points" (characters in the
	// astral plane, such as emoticons) before we insert the message.
	safeMessage = replaceSupplementaryCodePoints(
		message,
		function( codePoint, value ) {

			return( "[?]" );

		}
	);

	// Try to insert the substituted message.
	insertCommand = new Query(
		sql =
		"
			INSERT INTO
				char_test
			SET
				value = ?
			;
		",
		datasource="testing"
	);
	insertCommand.addParam( value = safeMessage, cfSqlType = "cf_sql_varchar" );
	insertCommand.execute();


	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //


	/**
	* I take the given string and replace supplementary / astral plane code-points with
	* the result of the given callback when it is invoked with an individual code-point.
	* Non-supplementary code-points (ie, those in the basic multilingual plane) are kept
	* in the string.
	*
	* @input I am the string in which supplementary code-points are being replaced.
	* @callback I am the function that generates the code-point replacements.
	* @output false
	*/
	public string function replaceSupplementaryCodePoints(
		required string input,
		required function callback
		) {

		// ColdFusion string are already Java Strings. However, this is not necessarily
		// "documented." By casting the input, we can be explicit with the data type that
		// we are trying to use and can safely access "member methods" of the Java class.
		input = javaCast( "string", input );

		// Get the number of chars in the input.
		var charLength = input.length();

		// Get the number of code-points in the input (may not be the same as chars).
		var codePointLength = input.codePointCount( javaCast( "int", 0 ), javaCast( "int", charLength ) );

		// If the char-length and the code-point-length are the same, the input doesn't
		// contain any supplementary code-points. These values will only start to differ
		// when we have a code-point that is a composite of a high and low surrogate
		// character pair. As such, if the two counts are the same, there's nothing that
		// needs to be replaced.
		if ( charLength == codePointLength ) {

			return( input );

		}

		var Character = createObject( "java", "java.lang.Character" );

		// We're going to iterate over each code-point and either keep it or replace
		// it. To reduce the string manipulation, we'll hold each resultant character
		// in an array and then collapse it at the end.
		var result = [];
		var i = 0;

		// When we loop over the string, there is no easy way (at least not in Java 1.7 -
		// Java 1.8 has code-point "streams") to loop over the code-points. Instead, we
		// have to loop over the chars and then increment the index based on the char-
		// length of the code-point at that given location within the input.
		while ( i < charLength ) {

			var codePoint = input.codePointAt( javaCast( "int", i ) );

			// Supplementary / astral plane code-point.
			if ( Character.isSupplementaryCodePoint( javaCast( "int", codePoint ) ) ) {

				var substitution = callback(
					codePoint,
					arrayToList( Character.toChars( javaCast( "int", codePoint ) ), "" )
				);

				arrayAppend( result, substitution );

			// Basic multilingual plane code-point.
			} else {

				arrayAppend( result, chr( codePoint ) );

			}

			// Since some iterations might need to be incremented by one (basic
			// multilingual plane characters) and some iterations might need to be
			// incremented by two (supplementary / astral plane characters), let's just
			// defer to the Character class to tell us how many characters are needed
			// to represent the code-point in question.
			i += Character.charCount( javaCast( "int", codePoint ) );

		}

		return( arrayToList( result, "" ) );

	}

</cfscript>

As you can see, for each invocation of the callback, I'm returning the string, "[?]". The value I end up inserting into the database is the sanitize, "utf8" compatible value:

I was like [?], so I went in for the kiss.
But, she was like [?] so I was like [?].
Then, she went [?] so I started [?].

In my example, I'm using Java 1.7, which doesn't have an easy way to iterate over the code-points. Apparently, Java 1.8 now provides some sort of stream or iterator for code-points; but, I don't yet have those features at my disposal. As such, I have to manually iterate over the string characters and convert them to code-points. And, any time that I come across a supplementary character, I have to increment my loop offset by 2 (since each supplementary character is composed of a high / low surrogate pair of characters). Luckily, Java provides some decent tooling in the String and Character classes around these calculations; so, my loop becomes more a task of orchestration rather than of implementation.

Character encoding continues to be one of those things that I think I get, but I probably don't truly understand. Much like working with Timezones and Daylight Saving Time (DST). Thankfully, ColdFusion is built on top of Java which means I have access to an ocean of functionality that can help me tackle these sort of problems. For now, while it's not the "right" solution, stripping out supplementary characters is the "best" solution that I can come up with until we ALTER our table to support the "utf8mb4" character set.

One day, the people will have their "pile of poo" emoticons - but today is not that day.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/go/3074

Reader Comments

Giancarlo Gomez Apr 18, 2016 at 8:54 AM

2 Comments

Nice Ben, when I had to do this for CrossTrackr I was so mad that I could not easily do it with RegEx but then I found an article that showed show how do it in PHP. So my hack is a function that basically sends the string to a PHP file using cfhttp and returning back the clean string. Since I use ORM I have a CFC which is extended by others that contains some defaults and I have a cleanEmoji() function which can receive a string and return back the clean version.

The article I believe is this one, as I forgot and just googled a piece of the snippet I have.
https://www.drupal.org/node/2043439

The gist of how I use it
https://gist.github.com/GiancarloGomez/e62928a665ef796464137bafc6cb20ec

Mathias Bynens Apr 18, 2016 at 9:13 AM

1 Comments

Hopefully you can enable

utf8mb4

soon :)

Ben Nadel Apr 18, 2016 at 11:30 AM

15,688 Comments

@Mathias,

Yeah, I'm really hoping so!

Ben Nadel Apr 18, 2016 at 11:32 AM

15,688 Comments

@Giancarlo,

That's quite a workflow :D It's crazy the kind of hoops we jump through when simpler solutions aren't necessarily "available", right?

Giancarlo Gomez Apr 18, 2016 at 1:20 PM

2 Comments

Definitely Ben, I do have utf8mb4 enabled on the tables though, but the only reason I keep the string cleaning function is because I have not been able to make time to work on a solution to allow editing and saving when the data is accessed on a platform / browser that does not support it. When I do though, I may try to blog about it. That alone is quite a task at times so thank you for taking time to blog about things and help the community.

Chris G Apr 19, 2016 at 9:43 PM

208 Comments

would be better ux if you could map specific astral plane characters to something more meaningful and slack-like... for example :thumbs-up: and :simple-smile:

Then, if you ever were successful in upgrading your table to utf8mb4, you could restore the original messages pretty easily.

Ben Nadel Apr 20, 2016 at 4:00 PM

15,688 Comments

@Chris,

Great minds think alike :D I actually created a struct that mapped the common emoticon code-points to the tokens that we use internally:

var tokenMapping = {
	"#inputBaseN( '1F601', '16' )#": ":grin:",
	"#inputBaseN( '1F602', '16' )#": ":joy:",
	"#inputBaseN( '1F603', '16' )#": ":smiley:",
	"#inputBaseN( '1F604', '16' )#": ":grinning:",
	"#inputBaseN( '1F605', '16' )#": ":sweat_smile:",
	"#inputBaseN( '1F606', '16' )#": ":satisfied:",
	"#inputBaseN( '1F609', '16' )#": ":wink:",
	"#inputBaseN( '1F60A', '16' )#": ":blush:",
	"#inputBaseN( '1F60B', '16' )#": ":yum:",
	"#inputBaseN( '1F60C', '16' )#": ":relieved:",
	"#inputBaseN( '1F60D', '16' )#": ":heart_eyes:",
	"#inputBaseN( '1F60F', '16' )#": ":smirk:",
	"#inputBaseN( '1F612', '16' )#": ":unamused:",
	"#inputBaseN( '1F613', '16' )#": ":cold_sweat:",
	"#inputBaseN( '1F614', '16' )#": ":pensive:",
	"#inputBaseN( '1F616', '16' )#": ":confounded:",
	"#inputBaseN( '1F618', '16' )#": ":kissing_heart:",
	"#inputBaseN( '1F61A', '16' )#": ":kissing_closed_eyes:",
	"#inputBaseN( '1F61C', '16' )#": ":stuck_out_tongue_winking_eye:",
	"#inputBaseN( '1F61D', '16' )#": ":stuck_out_tongue_closed_eyes:",
	"#inputBaseN( '1F61E', '16' )#": ":disappointed:",
	"#inputBaseN( '1F620', '16' )#": ":angry:",
	"#inputBaseN( '1F621', '16' )#": ":rage:",
	"#inputBaseN( '1F622', '16' )#": ":cry:",
	"#inputBaseN( '1F623', '16' )#": ":persevere:",
	"#inputBaseN( '1F624', '16' )#": ":triumph:",
	"#inputBaseN( '1F625', '16' )#": ":disappointed_relieved:",
	"#inputBaseN( '1F628', '16' )#": ":fearful:",
	"#inputBaseN( '1F629', '16' )#": ":weary:",
	"#inputBaseN( '1F62A', '16' )#": ":sleepy:",
	"#inputBaseN( '1F62B', '16' )#": ":tired_face:",
	"#inputBaseN( '1F62D', '16' )#": ":sob:",
	"#inputBaseN( '1F630', '16' )#": ":cold_sweat:",
	"#inputBaseN( '1F631', '16' )#": ":scream:",
	"#inputBaseN( '1F632', '16' )#": ":astonished:",
	"#inputBaseN( '1F633', '16' )#": ":flushed:",
	"#inputBaseN( '1F635', '16' )#": ":dizzy_face:",
	"#inputBaseN( '1F637', '16' )#": ":mask:",
	"#inputBaseN( '1F638', '16' )#": ":smile_cat:",
	"#inputBaseN( '1F639', '16' )#": ":joy_cat:",
	"#inputBaseN( '1F63A', '16' )#": ":smiley_cat:",
	"#inputBaseN( '1F63B', '16' )#": ":heart_eyes_cat:",
	"#inputBaseN( '1F63C', '16' )#": ":smirk_cat:",
	"#inputBaseN( '1F63D', '16' )#": ":kissing_cat:",
	"#inputBaseN( '1F63E', '16' )#": ":pouting_cat:",
	"#inputBaseN( '1F63F', '16' )#": ":crying_cat_face:",
	"#inputBaseN( '1F640', '16' )#": ":scream_cat:",
	"#inputBaseN( '1F645', '16' )#": ":no_good:",
	"#inputBaseN( '1F646', '16' )#": ":ok_woman:",
	"#inputBaseN( '1F648', '16' )#": ":see_no_evil:",
	"#inputBaseN( '1F649', '16' )#": ":hear_no_evil:",
	"#inputBaseN( '1F64A', '16' )#": ":speak_no_evil:",
	"#inputBaseN( '1F64B', '16' )#": ":raising_hand:",
	"#inputBaseN( '1F64C', '16' )#": ":raised_hands:",
	"#inputBaseN( '1F64D', '16' )#": ":person_frowning:",
	"#inputBaseN( '1F64E', '16' )#": ":person_with_pouting_face:",
	"#inputBaseN( '1F64F', '16' )#": ":pray:",
	"#inputBaseN( '1F44D', '16' )#": ":thumbsup:",
	"#inputBaseN( '1F44E', '16' )#": ":thumbsdown:",
	"#inputBaseN( '1F44C', '16' )#": ":ok_hand:",
	"#inputBaseN( '1F44F', '16' )#": ":clap:",
	"#inputBaseN( '1F511', '16' )#": ":key:",
	"#inputBaseN( '1F525', '16' )#": ":fire:",
	"#inputBaseN( '1F377', '16' )#": ":wine_glass:",
	"#inputBaseN( '1F446', '16' )#": ":point_up:",
	"#inputBaseN( '1F447', '16' )#": ":point_down:",
	"#inputBaseN( '1F448', '16' )#": ":point_left:",
	"#inputBaseN( '1F449', '16' )#": ":point_right:",
	"#inputBaseN( '1F600', '16' )#": ":grinning:",
	"#inputBaseN( '1F610', '16' )#": ":neutral_face:",
	"#inputBaseN( '1F611', '16' )#": ":expressionless:",
	"#inputBaseN( '1F615', '16' )#": ":confused:",
	"#inputBaseN( '1F617', '16' )#": ":kissing:",
	"#inputBaseN( '1F619', '16' )#": ":kissing_smiling_eyes:",
	"#inputBaseN( '1F61B', '16' )#": ":stuck_out_tongue:",
	"#inputBaseN( '1F61F', '16' )#": ":worried:",
	"#inputBaseN( '1F626', '16' )#": ":frowning:",
	"#inputBaseN( '1F627', '16' )#": ":anguished:",
	"#inputBaseN( '1F62C', '16' )#": ":grimacing:",
	"#inputBaseN( '1F62E', '16' )#": ":open_mouth:",
	"#inputBaseN( '1F62F', '16' )#": ":hushed:",
	"#inputBaseN( '1F634', '16' )#": ":sleeping:",
	"#inputBaseN( '1F636', '16' )#": ":no_mouth:"
};

Was hella tedious :D

Ben Nadel Apr 20, 2016 at 4:02 PM

15,688 Comments

@Giancarlo,

Always a pleasure, good sir :)

Chris G Apr 20, 2016 at 8:16 PM

208 Comments

@Ben

I can only imagine how tedious that must have been. Great work! Having :grin: instead of [?] will be a much better user experience for sure. If you come across an emoticon code-point that doesn't map to a token, what do you do then? One thought is you save the code-point off and replace it with a unique token, look up that code-point against an [Emoji Unicode Table](http://apps.timwhitlock.info/emoji/tables/unicode), then replace the unique token and update your map.

Another thought is you look up the code-point against an API...there must be one, right? If not, maybe we could create one using a tool such as https://wrapapi.com or https://www.import.io/

You could import every emoji I suppose, but most of them probably never get used.

In any case, good stuff. I'm probably rambling at this point, but you got me thinking :)

Ben Nadel Apr 26, 2016 at 7:52 AM

15,688 Comments

@Chris,

If I don't have a mapped token, I end up falling back to the "[?]" notation. So, the code is essentially something like this:

// Start off with fallback replacement only.
var replacement = {
fallback: "[?]"
};

// Append the token-based mappings.
structAppend( replacement, getCodePointEmojiMap() );

// Replace values.
return( replaceSupplementaryCodePoints( comment, replacement ) );

.... in my production version of the code, the "replacement" can be a String, Struct, or Callback. In this case, I'm using a Struct that maps keys to values in the replacement.

I love the idea of having an API for something like this. I am not sure that there is a standard set of "tokens", though, associated with the code-points. That said, that *would be* something nice to have a standard for!

Oh my chickens, this post is old!

Hit me up on Twitter if you want to discuss it further.