Using Regular Expressions (RLIKE) Pattern Matching In MySQL Queries

By Ben Nadel

Published 2016-01-13 in SQL

If you've known me for a while, you know that I'm absolutely in love with Regular Expressions. I use them in ColdFusion and JavaScript all the time. And, the other day, for the very first time, I used them in MySQL. As a celebration of this milestone, I wanted to put together a quick post. This is not intended to be a tutorial, but rather a beacon of hope for anyone who didn't realize that using Regular Expressions in MySQL was even a possibility.

The Regular Expression (REGEXP / RLIKE) support in MySQL is not as robust as it is in other programming languages. And, the Regular Expression engine can't leverage table indices; so, you're basically doing a full set scan for pattern matching. But, they are still very awesome and can make running data reports much easier.

To start with, I wanted to see what kind of performance hit the REGEXP / RLIKE operator had when compared to the less robust, less flexible LIKE operator. While not necessarily a scientific test, I tried to query for email addresses, in a 30-million record table, that ended with "@bennadel.com":

-- When using the LIKE operator, three attempts ran in: 34s, 22s, 22s.
SELECT
	u.id,
	u.email
FROM
	user u
WHERE
	u.email LIKE '%@bennadel.com'
;

-- When using the RLIKE operator, three attempts ran in: 144s, 141s, 140s.
SELECT
	u.id,
	u.email
FROM
	user u
WHERE
	u.email RLIKE '@bennadel\.com$'
;

While neither of these operations (suffix matching) could leverage table indices, the LIKE operator ran 5-6 times faster than the RLIKE / REGEXP operator to find the same data. I don't point this out to say that RLIKE is bad; merely, that it has some overhead and should be used when its the "right" tool for the job - not as a replacement for basic LIKE operations.

That said, here's what some of the possible pattern matching looks like:

SELECT
	-- Using basic character classes and not-operator.
	-- Result: 1
	( 'ben@bennadel.com' RLIKE '[^@]+@[^.](\.[^.]+)+' ) isEmail,

	-- Using optional character matching.
	-- Result: 1
	( 'data.xls' RLIKE '\.xlsx?' ) AS isXLS,

	-- Using digit-matching and variable-length patterns. Unfortunately, RLIKE
	-- doesn't support notations like \d and \s. Instead, you have to use special
	-- character class names.

	-- Result: 0 -- doesn't support \d
	( '212-555-3932' RLIKE '\d{3}-\d{3}-\d{4}' ) AS isTele,

	-- Result: 1
	( '212-555-3932' RLIKE '[[:digit:]]{3}-[[:digit:]]{3}-[[:digit:]]{4}' ) AS isTele2,

	-- Result: 1 -- using support for character ranges.
	( '212-555-3932' RLIKE '[0-9]{3}-[0-9]{3}-[0-9]{4}' ) AS isTele3,

	-- Testing word-boundaries. And, again, since RLIKE doesn't support notations
	-- like \b, you have to use special character class names.

	-- Result: 0 -- doesn't support \b
	( 'You have a warrant?' RLIKE '[[:<:]]war[[:>:]]' ) AS isWar,

	-- Result: 1
	( 'Yo, war is lame.' RLIKE '[[:<:]]war[[:>:]]' ) AS isWar2,

	-- RLIKE is not case sensitive, except when used with binary strings.
	-- Result: 1 (for both tests)
	( 'BEN' RLIKE 'ben' ) AS isBen,
	( 'BEN' RLIKE 'BEN' ) AS isBen2
;

As you can see, there is an unfortunate lack of support for core character class notation in the form of \w, \b, \s, etc. But, you can used named character classes like [:digit:] to make up for that. It's a bit more cumbersome, but it works just the same.

So there you are - Regular Expression power meets relation database power for much sexiness! And while this might not affect much of my day-to-day SQL authoring, I am quite sure that I'll be making use of this in my SQL reporting efforts.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/2993

Reader Comments

Oh my chickens, this post is old!

Hit me up on LinkedIn if you want to discuss it further.