Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
Ben Nadel at Scotch On The Rock (SOTR) 2010 (London) with: Claude Englebert
Ben Nadel at Scotch On The Rock (SOTR) 2010 (London) with: Claude Englebert@cfemea )

MySQL Truncating Value At First Supplementary / Astral / utf8mb4 Character On INSERT And UPDATE

By Ben Nadel on

As I've been embracing [one of] my roles as the Database Administrator's understudy at InVision App, I've been trying hard to track down every vexing problem relating to our MySQL RDS instances. For months, we've been having a problem where data, coming out of a Node.js microservice, was being truncated on the first character that required the utf8mb4 character set. While I don't yet understand why our ColdFusion microservices and our Node.js microservices are handling this differently (I suspect the Java driver is just better than the Node.js driver), I think I finally figured out why Node.js is able to insert corrupted data: we're using a lenient "sql_mode" setting.

For context, this is for data going into a database table that only supports the utf8 character set, not the utf8mb4 character set (which requires up to four bytes for each code-point, using a pair of high / low surrogate characters). When the ColdFusion MySQL 5 driver goes to insert data that contains supplementary / astral plane characters into this table, it errors out with a message like:

Error Executing Database Query. Incorrect string value: '\xF0\x9F\x98\x8D, ...' for column 'value' at row 1.

Now, on the other hand, when the Node.js MySQL driver goes to insert the same data, there is no error. The INSERT succeeds; but, the data becomes corrupted, truncating the input at the same character that would have caused an error in the ColdFusion MySQL driver.

NOTE: The fact that we have two "microservices" accessing the same database is not lost on me; it makes me sad.

I don't know why the ColdFusion driver works better than the Node.js driver (except for the fact that ColdFusion is amazing). It could be something in the connection string. But, even when I have what I believe to be the same connection string in both drivers, the problem still persists.

That said, after much head-banging, experimentation, and Googling, I finally came across the blog post, "Data Corruption To Go: The Perils Of sql_mode = NULL," by Keyur Govande. This post cued me in on the concept of "sql_mode," which I had never heard of before. SQL modes affect the SQL syntax that MySQL supports and the data validation checks that it performs during mutation events.

By default, MySQL uses a sql_mode named "NO_ENGINE_SUBSTITUTION". This is a non-strict mode in which problematic data produces "warnings" not "erros". This is what one of our MySQL RDS instances is using. And, it's the reason data is being corrupted. To see this in action, we can set the current session to use NO_ENGINE_SUBSTITUTION and then try to insert problematic data:

  • -- This mode produces warnings, not errors for certain problematic
  • -- operations.
  • SET SESSION sql_mode = 'NO_ENGINE_SUBSTITUTION';
  •  
  • -- Try to insert data that requires the utf8mb4 character set (into
  • -- a table that only supports utf8).
  • INSERT INTO
  • char_test
  • SET
  • id = 100,
  • value = 'ha ha <?> you crazy!'
  • ;
  • show warnings;
  •  
  • -- Check to see what value was inserted.
  • SELECT
  • value
  • FROM
  • char_test
  • WHERE
  • id = 100
  • ;

When we run the above SQL, we get the following output:


 
 
 

 
 sql_mode NO_ENGINE_SUBSTITUTION 
 
 
 

As you can see, the inserted value is being truncated on the first occurrence of a character that requires utf8mb4. And, if run:

  • SHOW WARNINGS;

... we get the following:

Warning 1300 - Invalid utf8 character string: 'F09F98'
Warning 1366 -Incorrect string value: '\xF0\x9F\x98\x81 y...' for column 'value' at row 1

In the non-strict "NO_ENGINE_SUBSTITUTION" mode, we get warnings about the problem; but, that doesn't prevent the INSERT for taking place.

Now, if we switch over to "STRICT_ALL_TABLES" mode, which rejects invalid data, we start to see the desired error behavior:

  • -- This mode produces warnings, not errors for certain problematic
  • -- operations.
  • SET SESSION sql_mode = 'STRICT_ALL_TABLES';
  •  
  • -- Try to insert data that requires the utf8mb4 character set (into
  • -- a table that only supports utf8).
  • INSERT INTO
  • char_test
  • SET
  • id = 200,
  • value = 'ha ha <?> you crazy!'
  • ;
  •  
  • -- Check to see what value was inserted.
  • SELECT
  • value
  • FROM
  • char_test
  • WHERE
  • id = 200
  • ;

When we run the above SQL, we get the following output:


 
 
 

 
 sql_mode STRICT_ALL_TABLES 
 
 
 

As you can see, this time, the INSERT fails completely and throws the error:

Error : Incorrect string value: '\xF0\x9F\x98\x81 y...' for column 'value' at row 1

This is definitely what we want. Even if the underlying problem is really that our data table has problematic character set support, we'd rather have the statements fail entirely rather than "looking" like they worked and quietly resulting in corrupted data.

I don't know what the broader implications of changing the sql_mode may be (I'm just reading up on all of this for the first time). I suspect that this can only result in better application behavior. But, I'll definitely be reviewing it with the team before we take any action. As a final note, I would highly suggest reading Keyur Govande's blog post as it talks about a number of other problems that non-strict SQL modes can cause.




Reader Comments

@All,

Note: my blog doesn't support utf8mb4 characters either :D So, my SQL snippets are using <?> instead of the actual emoticon literal values.

Reply to this Comment

Post A Comment

You — Get Out Of My Dreams, Get Into My Comments
Live in the Now
Oops!
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.