Introduction To Regular Expressions Presentation Notes

By Ben Nadel

Published 2007-01-05 in ColdFusion, JavaScript / DHTML, Work — Comments (21)

I am giving this presentation today at my company, Nylon Technology. We have a bi-weekly staff meeting at which we have a general knowledge share and one employee always gives some sort of presentation. I am not going to lie to you, I give most of the presentations and contribute the most to knowledge share; most people are just not as jazzed about this stuff as I am.

Anyway, I have never posted my presentation notes before so I thought I would try that today. I can do that for future presentations if anyone finds this useful??? A lot of this stuff is probably review and much of it has already been covered on my blog, so this might be a waste of time for Kinky Solutions fans, so let me know. Thanks.

.... onto the presentation notes......

Note: This introduction is not ColdFusion specific. Regular expressions are used in many programming languages such as ColdFusion, Java, and Javascript (all of which we use on a regular basis). While most of this is applicable to all languages, please realize that each language has its own set of regular expression nuances.

Also realize that regular expressions are very complicated and a lot of their functionality is NOT covered in this introduction.

Regular expressions are patterns used for matching and / or replacing parts of a character string. These patterns can be of a set length such the substring "ab" or they can be a dynamic length.

By default, regular expressions are greedy. This means that they will attempt to match the longest string possible even if a shorter match has already been found. Regular expressions can be flagged for non-greedy searching, which will be covered later.

A pattern can be composed of character literals such as "a", "b", "c", "1", "2", "3", etc.

Pattern: "a"
Match: Anna Banana
Description: Match the letter "a".

Pattern: "an"
Match: Anna Banana
Description: Match the character sequence "an".

Notice that in the above matches, the letter "a" does NOT match the letter "A" in the target string. Regular expressions are case sensitive by default but can be flagged as being case insensitive (which is covered later).

Character Selectors / Operators

A pattern of character literals is nice, but not that useful on its own. To make it more useful, we can make the regular expression pattern more dynamic by asking it to match on variable length substrings. The following operators, when places AFTER a pattern sequence, determine how many times that pattern sequence will be matched.

* : Zero or more instances

Pattern: "a*"
Match: .A.n.n.a .B.a.n.a.n.aDescription: Match the letter "a" zero or more times in a row. Be careful when matching zero times - crazy results! (Thanks Gus). I have put "." in the zero length matches.

+ : One or more instances

Pattern: "a+"
Match: Anna Banana
Description: Match the letter "a" one or more times in a row.

? : Zero or one instance

Pattern: "https?"
Match: http://www.bennadel.com https://www.bennadel.com
Description: Match the character sequence "http" followed by zero or one "s".

{N} : Matches exactly N instances

Pattern: "n{2}
Match: Anna Banana
Description: Match the character sequence consisting of exactly two "n".

{N,} : Matches N or more instances

Pattern: "w{2,}
Match: http://www.wicked-hot-workouts.com
Description: Match the character sequence consisting of two or more "w".

Note: The pattern {1,} is the same as "+".

{M,N} : Matches M or more instances but no more than N instances

Pattern: "w{2,3}
Match: http://www.wicked-hot-workouts.com
Description: Match the character sequence consisting of at least 2 "w" but no more than 3 "w".

Note: The pattern {0,1} is the same as "?".

| : OR Operator

This is not exactly like the other selectors, but I wasn't sure where else to bring it up. The "|" will match on what ever is the left of the pipe OR to the right of the pipe.

Pattern: "a|nna
Match: Anna Banana
Description: Match the character sequence consisting of "a" OR "nna".

Notice that the "|" is matching on the entire string left OR right of the "|". If you only want to "OR" part of a string, you have to group the characters that you want to "OR" (more on grouping will be mentioned later):

Pattern: "(a|n)na
Match: Anna Banana
Description: Match the character sequence consisting of "a" OR "n" followed by the character sequence "na".

Greedy vs. Non-Greedy Matching

As I stated before, regular expression are, by default, greedy. That means that they will match the largest possible string even if smaller substrings have already been matched. In order to get a regular expression pattern to be non-greedy simply put a "?" after the sequence operator.

For example, the pattern "w+?" will match the letter "w" at least one time, but will break up a string of "w"s into individual matches rather than one big match. This can be hard to see unless you are doing something to the match. Let's take a look at the difference in matching groups.

Pattern: "n+"
Match: Anna Banana
Description: Match the character sequence consisting of at least one "n" in a greedy fashion.

The three highlighted sections above are the three matching sequences. Now, let's compare that to a non-greedy search. For clarity, I will break the matches up onto separate lines.

Pattern: "n+?"
Match: Anna Banana
Match: Anna Banana
Match: Anna Banana
Match: Anna Banana
Description: Match the character sequence consisting of at least one "n" in a non-greedy fashion.

Notice that the double "n" in "Anna" is broken up into two different matches rather than one match as in the greedy expression.

Character Classes

If you don't want to match a specific sequence of characters such as "ab", character classes allow you to match on a set of characters. The following are some of the character set available in regular expressions.

\s = White space character (space, tab, return, new line).
\S = NOT white space character.

Pattern: "\s+"
Match: AnnaBananan
Description: Match the character sequence consisting of one or more white space characters.

Pattern: "\S+"
Match: Anna Bananan
Description: Match the character sequence consisting of one or more non-white-space characters.

\w = Word characters (alpha-numeric characters).
\W = NOT word characters.

Pattern: "\w+"
Match: Anna Bananan
Description: Match the character sequence consisting of one or more word characters.

Pattern: "\W+"
Match: AnnaBananan
Description: Match the character sequence consisting of one or more non-word characters.

\d = Digit (0 - 9).
\D = NOT digit.

Pattern: "\d+"
Match: (212) 691-1134
Description: Match the character sequence consisting of one or more digit characters.

Pattern: "\D+"
Match: (212) 691-1134
Description: Match the character sequence consisting of one or more non-digit characters.

\b = Word boundary
\B = NOT word boundary
Word boundaries are zero-length matches. They do not match actual characters but rather they qualify the type of character.

Pattern: "\b\w+\b"
Match: Anna Banana
Description: Match the character sequence consisting of one or more word characters delimited by word boundaries.

. = Wild card
Be careful, this has different "defaults" in different regular expression implementations.

Pattern: ".+"
Match: Anna Banana
Description: Match the character sequence consisting of one or more characters.

Notice that most of the character classes are defined as the "\" and another character. This is because the "\" is a special character in regular expressions. If you want to match the "\" literal in a string (such as in a file path), you need to escape it. Special characters (including the "\") can be escaped by preceding them with a "\". So, to match on the "\" literal, your pattern would be "\\".

Character Sets

When you want to match specific set of characters but you are not concerned with the sequence in which they are matched, you can use a character set. A character set is defined by an open and close brackets.

For example, the character set "[aeiouy]" matches vowel characters. Sequence selectors / operators are applied to this just as they are applied to character literals. "[aeiouy]+" will match one or more vowels in a row (due to the "+" selector). Now, while this will match one or more vowels, there is nothing inherent to a character set that defines the order of characters. "[aeiouy]+" will match "aaaa" but it will also match "yuoiea".

Character sets can contain characters and character spans. "[a]" will match any lower case "a", but "[a-z]" will match any lower case letter between "a" and "z". Character spans do not have to be full spans; for example, you can span "a-j" instead of the complete "a-z" span.

[a-z] = Matches all lower case letters.
[A-Z] = Matches all upper case letters.
[0-9] = Matches all digits.

Character sets can be a mixture of just about anything including character literals, character classes, and character spans.

[a-z0123] = Matches all lower case letters and the digits 0, 1, 2, 3.
[\w\W] = Matches all word and NOT word characters (this will match EVERYTHING).
[0-37-9] = Matches the characters 0, 1, 2, 3, 7, 8, 9 (excluding 4, 5, 6).

A character set can also be composed of the characters that you do NOT want to match. If your character set begins with "^", this signifies that you want to match characters that are NOT in the set. For example, "[^aeiouy]+" will match any character sequence that does NOT contain a vowel.

Pattern: "[a-z]+"
Match: http://www.wicked-hot-workouts.com/
Description: Match the character sequence consisting of one or more lower case letters.

Pattern: "[a-j]+"
Match: http://www.wicked-hot-workouts.com/
Description: Match the character sequence consisting of one or more lower case letters in the span "a" to "j".

Pattern: "[a-z:/=.-]+"
Match: http://www.wicked-hot-workouts.com/
Description: Match the character sequence consisting of one or more lower case letters or any of the following ":", "/", "=", ".", "-".

The "[" is a reserved character in regular expressions meant to denote the beginning of a character set. If you want to use the "[" literal in a pattern, you must escape it using the "\" as in "\[".

Grouping Character Sequences

A sequence of characters can be grouped using the parenthesis "(" and ")". Grouping is used when you need to make back-references either during a pattern match or during some sort of replace function.

When grouping, each created group is given an index starting with 1. Each group is determined by its open parenthesis going from left to right regardless of nesting. See the following example to see group numbering:

Pattern: "(A(nn)a) (B(an)ana)"
Group 1: Anna
Group 2: nn
Group 3: Banana
Group 4: an

NOTE: You can create groups that do not match, but in my opinion, are not that useful and are beyond the introductory nature of this presentation.

The groups can be referenced using the notation "\N" where N is the group index. (Some languages use "$" instead of "\" for group references). You can only reference a group that has already been matched.

Pattern: "(an)\1"
Match: Anna Banana
Description: Match the group, consisting of the character sequence "an", followed by a duplicate match of the previous group.

Note that in the above example, the "\1" is referring to the group "(an)". Therefore, this will find a repeat of "anan".

You can use standard selectors / operators on grouped sequences the same way you can on characters or character sets. Just place the selector after the close parenthesis:

Pattern: "(an){2}"
Match: Anna Banana
Description: Match the group, consisting of the character sequence "an", exactly two times.

Note that in the above example, only ONE match is made "anan", not two matches of "an". This accomplishes the same thing as the previous example, "(an)\1".

Text Boundaries

By default, a regular expression pattern will match any part of a string that fits into the pattern. You can however tell the pattern that it has to match in certain areas of the target text. The "^" when placed at the beginning of a pattern signifies that the match must start at the beginning of the target text. A "$" places at the end of a pattern signifies that the match must end at the end of the target text.

Pattern: "^\w{3}"
Match: Anna Banana
Description: Match the character sequence consisting of three word characters at the beginning of the target text.

Pattern: "\w{3}$"
Match: Anna Banana
Description: Match the character sequence consisting of three word characters at the end of the target text.

By using both the "^" and the "$", the regular expression pattern MUST match the entire target text.

Pattern: "^[\w\s]+$"
Match: Anna Banana
Description: Match the entire target text against the character sequence consisting of one or more word and / or space characters.

This is beyond the scope of the introduction, but if you flag the pattern for "multiline" matching, these text boundaries apply to each line of text (as delimited by a line break) as being the target text.

Case Sensitivity

By default, a regular expression is case sensitive. You can, however, flag the pattern as being case insensitive by starting the pattern off with the flag "(?i)". This flag (which is one of many available flags) must be the FIRST thing in the pattern. Notice the difference in the following patterns:

Pattern: "[a-z]+"
Match: Anna Banana
Description: Match the character sequence consisting of one or more lower case letters.

Pattern: "(?i)[a-z]+"
Match: Anna Banana
Description: Match the character sequence consisting of one or more letters without case sensitivity.

Reserved And Special Characters

As I mentioned before, there are several reserved characters such as the "(", "[", "\", "^", "$", (among others) that are not treated as character literals. In order to use them as character literals, you must escape them by preceding them with a "\".

For example, if you wanted to check for a valid phone number that uses "(" and ")", you would have to escape the parenthesis:

Pattern: "$\d{3}$ (\d{3})-(\d{4})"
Match: (212) 691-1134
Description: Match the sequence consisting of the character literal "(" followed by three digits followed by the character literal ")" followed by a space followed by the GROUP consisting of three digits followed by a dash followed by a GROUP consisting of three digits.

Note that in the above example, the first set of parenthesis is escaped and the second and third are not. The second and third set of parenthesis are therefore used to group the character sequences.

Words of Wisdom

Regular expressions are totally amazing, and once you learn about them, you might tend to see everything in patterns. Be careful NOT to get carried away. Sometimes, having two different regular expressions is easier to write and maintain (than one larger, more complicated one). Also, regular expression do come with some parsing and processing overhead; a longer, more complicated regular expression is probably going to take longer to execute than two shorter one.

Short link: https://bennadel.com/458

Reader Comments

Gus Jan 5, 2007 at 8:25 AM

18 Comments

Ben,

You have a small error in this section:

* : Zero or more instances

Pattern: "a*"
Match: Anna Banana
Description: Match the letter "a" zero or more times in a row.

This will actually match just before the capital 'A' in Anna Banana. You always have to be careful with matching 'Zero or more' because the zero part will often match where you don't expect it.

Don't know if it will come through, but here is some code to test:

<cfoutput>
#s#
<p>
becomes
<p>
#result#
</cfoutput>

Gus

Ben Nadel Jan 5, 2007 at 8:44 AM

16,233 Comments

Gus,

Nice catch! I must have not tested that one properly. Rock on.

Peter Bell Jan 5, 2007 at 10:54 AM

111 Comments

Very Cool! Please keep posting these, and now I know EXACTLY who to bug when I have a tricky RegEx to figure out :->

Ben Nadel Jan 5, 2007 at 11:25 AM

16,233 Comments

No problem. I love regular expression. I am no voodoo, black magic mad man like Dinowitz, but I can do a thing or two ;) Let me know if you have any questions.

Phillip Senn Jan 23, 2007 at 9:23 AM

16 Comments

Remind me in 10-11 months to nominate you for a cfEmmy.

Ben Nadel Jan 23, 2007 at 9:27 AM

16,233 Comments

Awww shucks :D

Let me know if there is any other CF-Emmy material / demos that I could write for a nomination ;)

Steven Levithan Feb 1, 2007 at 12:55 AM

172 Comments

Ben, nice to find another CF developer who's in love with regular expressions. I've added your blog to my feed aggregator.

Ben Nadel Feb 1, 2007 at 7:29 AM

16,233 Comments

Steve,

Yeah, regular expressions are awesome. You should check out a recent post I did on using regular expressions to find URLs in a block of text:

www.bennadel.com/index.cfm?dax=blog:487.view

Anyway, let me know if you would like to see any demos of anything or need some help with any expressions. I love working on this stuff.

Steven Levithan Apr 3, 2007 at 9:32 PM

172 Comments

Just wrote a regex-related post of my own which I think is kind of cool.... and I felt like sharing (spaming), since you're probably the only other regex aficionado I know. :-)

Faking Atomic Groups: http://badassery.blogspot.com/2007/04/faking-atomic-groups.html

Ben Nadel Apr 3, 2007 at 10:07 PM

16,233 Comments

@Steve,

Please, post-away! You have the crazy RegEx skills and I would be doing others a disservice by preventing your link ;)

The stuff you wrote looks very cool. I just had to look up (again) what possessive quantifiers were. I can never seem to remember. I think what I need to do is really try them out in an example that will make them stick like a burr to my mental socks.

On another note, not so long ago, I referred someone to your site who had a very tough RegEx question that I could not answer. Not sure if they ever made it to you.

Daniel Harvey Oct 12, 2009 at 10:21 AM

32 Comments

I am trying to use the java regex to be able to match the innermost cf comments. Would you have any idea for this. All my attempts have failed horribly.

Daniel Harvey Oct 12, 2009 at 11:37 AM

32 Comments

Nevermind I have figured out a way to solve my issue

Ben Nadel Oct 15, 2009 at 2:05 PM

16,233 Comments

@Daniel,

That's always a hard thing. You can make sure that there are no starting comments within the current comment. Maybe something like this with a negative look ahead:

).)*--->

Daniel Harvey Oct 15, 2009 at 2:16 PM

32 Comments

That is similar to what I am using now and thanks for working on it I found it to be very interesting in my attempts to do it.

Here is what I am using with the reason why it is different than yours, but overall same idea.

).)*+--->

I without the s wouldn't it just look on the current line? and I don't need any capturing groups. thus I have the "?s:".

And using the possessive quantifier *+ instead of greedy should make it more efficient.

And a big thanks to Peter Boughton who helped me come up with this solution.

Ben Nadel Oct 15, 2009 at 2:22 PM

16,233 Comments

@Daniel,

I am not sure what the "?s:" part is. I know that "?:" will prevent the capture, as you say. And, certainly, I'm not against the possessive qualifier.

OH! The ?s is the "single line" flag. You know, I've never used that in the middle of an expression before. And, I didn't realize that you could combine that with the non-capturing group notation as well. VERY COOL!

Daniel Harvey Oct 15, 2009 at 2:25 PM

32 Comments

Ya that is all it is. It works great. I believe it will allow you to make it so only certain groups will have the single line flag

Ben Nadel Oct 15, 2009 at 2:31 PM

16,233 Comments

@Daniel,

Yeah, that's exactly what it does. I read about it, but never tried it. Very cool - gives me something to play around with. Thanks again.