Regular Expression Finds vs. String Finds
I suggested to Michael Dinowitz last night that the short-circuit evaluation of multiple OR statements (within a CFIF) would execute faster than the short-circuit evaluation of a regular expression with multiple "or" pipes ("|"). For example, I argue that:
<cfif (
(strValue.IndexOf( "Sarah" ) GTE 0) OR
(strValue.IndexOf( "Libby" ) GTE 0) OR
(strValue.IndexOf( "Cindy" ) GTE 0) OR
(strValue.IndexOf( "Katie" ) GTE 0) OR
(strValue.IndexOf( "Julia" ) GTE 0)
)>
<!--- Code here. --->
</cfif>
... is going to be faster than:
<cfif REFind( "Sarah|Libby|Cindy|Katie|Julia", strValue )>
<!--- Code here. --->
</cfif>
Michael disagrees. He says that from a processing perspective, finding a string (as with IndexOf()) requires more processing than checking the patterns of a regular expression. I can sort of understand what he is saying; regular expressions search on patters, so the first letter to exclude a pattern means that none of the rest of the letters are checked. For instance, when searching for the patterns "Sarah", the algorithm will never check for an "a" unless it has found an "S".
This makes sense, but it doesn't make me feel better. It still feels wrong. What happens if you have a pattern where there is one partial match at the beginning of the RegEx and then full match at the end of the RegEx? The matching algorithm will keep matching on the first part until it breaks, then it will have to continue matching on the pattern at the end that will fully match it. This just seems too dynamic to be fast.
Then, this morning, I think I put my finger on what made me uncomfortable: the Regular Expression needs to be evaluated before it is matched against I think?? And this is what makes it slower than straight-up string indexing. Think about it this way - let's say you have a million patterns that are OR'ed together using "|" (as in the example above, just much bigger). Then let's say you have a million OR statements (also as in the example above). For the OR statements to run, there is no overhead - you just run one OR statement after another until you find a match. With the Regular Expression, however, there is overhead: either the regular expression gets compiled somehow or it gets parsed (I ASSUME). If each pattern is 10 characters long and we have a million patterns, that's 10 MILLION characters (not including the pipe characters) that need to get parsed into some sort of regular expression object before matches can be made.
In summary, if our matching is a best-case scenario in which the first part (of the OR statements or the "|" statements) finds a match, the processing is as follows:
OR Statements:
- Match first string
RegEx "|" Statements:
- Parse entire regular expression
- Match first pattern
Again, I stress that I don't know for a fact that this is how it works. Maybe someone else here can explain how Regular Expression are evaluated by ColdFusion? I would be interested in seeing what others have to say. Thanks!
Want to use code from this post? Check out the license.
Reader Comments
Personally, I think the two are going to come out pretty similar, but you are right that upon a greater number of cases or a large number of runs that the IndexOf method will come out faster, if for no other reason than the parsing / compiling of the regex.
One note I should mention before you go testing this theory is that if you are going to you the Java method of searching for the individual string, use the Java method (String.match) for the regex search as well as that will far skew your results due to ColdFusion dealing with casting to handle dynamic typing.
Something that may help you tell which part of regex processing is causing the issue (the parsing / compiling) or the actual search itself is by using a compiled regex search (java.util.regex.Pattern and Matcher).
Here is a link to those classes:
http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/package-summary.html
@Ben:
Just remember there's always a balance between code maintainable and code performance. A number of your entries as of late have been about performance differences between method A and method B. Rarely is a few milliseconds here or there going to make or break your application, but your ability to identify and fix bugs definitely can. (Obviously, there are exceptions to every rule.)
So in cases where performance gain would be marginal, maintainability/readability always wins out for me.
-Dan
@Mike,
I haven't used the String.match method much. I will take a look at that. Also, good link the Pattern / Matcher. I am a huge fan of those. I find it so much easier than using the step-wise REFind() with the returned arrays.
@Dan,
I completely agree about maintainability vs. performance. Also, so many of these things have to be done to such huge, unnatural numbers (iterations) to see any difference that there is no practical value to the testing. It's really just theory; more computer science, not so much computer engineering.
As to why I have been doing a lot of performance stuff lately (that really has no great purpose) - I am just pressed for time and that stuff is fairly small-time focus. I just like to keep my mind working.
But, in the case of talking to Dinowitz about it, I would almost say that it is important. We were talking about house of fusion, which gets millions upon millions of hits a month. He estimates that Google is spidering HoF at least once a second. In this kind of scenario, you might want to squeeze every last be of performance out of it as you can.
Anyway, thanks for the feedback guys. Have a great thanksgiving.
Hi Dan,
Really good point. For example, almost EVERY OO pattern has a worthwhile performance hit associated to it, but unless you're doing n+1 queries or other painfully slow things, for most use cases it doesn't matter as performance isn't as important as most developers think. Maintainability is a much bigger deal.
That said, KNOWING what is quicker and why is just a good piece of knowledge to have. I'm not going to lose much sleep over the performance of these kind of algorithms, but knowing that you should put the most likely case first in a case statement and the performance difference between if/then's and cases is just good background knowledge.
I personally find the tests interesting (similar to a lot of the stuff that Michael D is known for) and I'm glad Ben takes the time to do them so I can snack on interesting tidbits without having to take the time to run the tests myself which (for the reasons you mention) I probably wouldn't bother to!
Another point to consider is how the two approaches can be optomized.
There's huge potential to optimize a regex comprised of a long sequence of literal characters between bar operators. E.g., instead of testing for "Sarah|Sylvia|Samantha|Shania|Shanon|Janel", you could use "S(?:a(?:mantha|rah)|h(?:an(?:ai|on)|ylvia)|Janel". It may be much harder to read, but it would greatly improve the regex's speed at failing non-matches. Basically, as the regex went through the string, it would check if the character it's at is an "S", and if not, it would check if it's a "J". If it was neither, the regex would immediately move on and test if the next character was an "S" or "J". With an indexOf() test, the character would have to be compared to the first characters of all six names. Additionally, if an "S" was matched by the regex, if the next character was not "a", "h", or "y" the regex would immediately move on, and so on. With an indexOf match, it would have instead needed to continue down the list of names to test, starting each remaining name from the first character.
The affect of this kind of optimization can only be determined by what you're searching for and though. Consider an easy to demonstrate but silly example:
Say you had the following data:
"zzzzzzzzzz" (10 zs..... but note that if it were 10,000 zs it wouldn't change the following results)
...and you needed to test if any of the following 26 strings were found within it:
"zzzzzzzzza", "zzzzzzzzzb", "zzzzzzzzzc", "zzzzzzzzzd", "zzzzzzzzze", "zzzzzzzzzf", "zzzzzzzzzg", "zzzzzzzzzh", "zzzzzzzzzi", "zzzzzzzzzj", "zzzzzzzzzk", "zzzzzzzzzl", "zzzzzzzzzm", "zzzzzzzzzn", "zzzzzzzzzo", "zzzzzzzzzp", "zzzzzzzzzq", "zzzzzzzzzr", "zzzzzzzzzs", "zzzzzzzzzt", "zzzzzzzzzu", "zzzzzzzzzv", "zzzzzzzzzw", "zzzzzzzzzx", "zzzzzzzzzy", "zzzzzzzzzz"
(9 zs followed by a through z, with the only one that will match our test string [10 zs] coming last)
If you used a series of indexOf() GTE 0 statements, or a regex like "zzzzzzzzza|zzzzzzzzzb|zzzzzzzzzc|...", either way the search would take 260 steps to return a successful match (10 steps to fail each of the first 25 test strings, and another 10 to successfully match the last one).
An optimized regex could return the same result in 10 steps:
zzzzzzzzz[a-z]
Like I said, this example is silly. However, hopefully it gets the point across that using "a million patterns OR'ed together using '|'" would just be retarded.
Since I'm anal, I'll note that of course the pattern zzzzzzzzz[a-z] could be further optimized as z{1,9}[a-z], which would return the same result in just two steps (since the {n,n} operator is greedy).
Also, sorry for the typos. (E.g., "The affect of this kind of optimization can only be determined by what you're searching for and though" should be "...what you're searching for and through.")
Steve,
Very cool example. I am confused as to one thing though; the regular expression: z{1,9}[a-z] ... would that be better than just doing z{9}[a-z]? Is a set number of matches {9} less efficient than a greedy search for {1,9}? Thanks!
Ben, yes, that was a brainfart on my part. z{1,9}[a-z] is a completely different pattern. z{9}[a-z] is correct, as you noted.
z{9} is still greedy... z{9}? is lazy (and is of course entirely different from (?:z{9})?, which would be greedy, but optional). When patching an exact number of chars like that though, greedy/lazy makes no difference to the regex engine...they'd both take one step. One the other hand, there is of course a huge difference between how greedy/lazy ranges are handled, e.g. z{1,9} vs. z{1,9}?.
"When patching an exact number of chars" should read, "When matching...".
I don't know how I make so many typos like that, where the char I should've used isn't even close to what I hit on the keyboard.
Steve,
Don't worry about the typos... I am such a bad speller that your typos read perfectly fine in my head ;)
Steve is wrong on the topic of regular expressions here. /S(?:a(?:mantha|rah)|h(?:an(?:ai|on)|ylvia)|Janel/ would take almost exactly the same amount of time as /Sarah|Sylvia|Samantha|Shania|Shanon|Janel/. This is because any competent regular expression library compiles the regular expression into an optimized DFA, then uses that DFA to match against strings. The implication of "optimized DFA" is that you have expressed the pattern with the minimum possible number of states and transitions. It's possible to have several different configurations of an optimized DFA, but they will all take the exact same amount of time to process.
The issue of parsing the regular expression doesn't have to be painful. Generally speaking, you can compile the regex in your preprocessing. This way, you do not need to compile it every time you make the search, just once, during program initialization.
Which is faster: You can imagine cases where the string matching will be faster. If the first item in the list matches, the string matching should be exclusively faster. However, if none of the items match, then the regex should be faster. Here's why:
String matching process: (input: xyzzy)
Try to match first character of Sarah: fail (1 character comparison)
Try to match first character of Libby: fail (1 character comparison)
...
FAIL
(Total cost: five character comparions.)
Sometimes the string libraries will do clever things like checking the length of the string, then heuristically starting from the back. But it is still a minimum of one comparison per match candidate.
Regex matching process: (input: xyzzy)
Try to look up X in transition table: -> FAIL (1 lookup == 1 char comparison)
FAIL.
As you can see, the DFA based regular expression approach is faster. This will be true in almost every case, and becomes more and more true as the number of alternatives increases. In the optimal case for the string matching, the string matching is only very slightly faster, if at all.
The only potential downside to the regex in this case is memory consumption. The string matching is in-place, but the regex requires the storage of transition and state tables, which can be hefty (256 chars * num_states).
Hope this helps.
@James, with a DFA matching algorithm, the internal process you describe would be correct, but most traditional NFA engines will not compile that regex to a DFA. Some programs will try a DFA match if none of the more advanced features associated with NFAs are required, but this is far from universal, since even with simple regexes like these, the "longest of the leftmost" functionality of DFAs (also required by POSIX) will change the meaning of the regex in ways many users of traditional NFAs would not expect or desire. If what you said was true, projects like list2re ( http://search.cpan.org/~dankogai/Regexp-Optimizer-0.15/lib/Regexp/List.pm ) would not exist.
I don't really have any understanding of what either of you are talking about :( Are you saying that the Pattern.Compile() function in Java optimizes the regular expression for me?
Also, Steve, would mind taking a look at this regex: www.bennadel.com/index.cfm?dax=blog:976.view and let me know if that seems optimal. It's really short, but it's important that this one is performing well as its needs to do a lot of work. You rock!
We should run an experiment. I did look up some of the implementation details after I posted that comment, and I can see that the the "NFA" backtracking solution is different from the DFA-based one I learned in my algorithm class. (Also, it's been a few years since I took that class and the details are fuzzy.)
Here are the results of the experiment (1000000 trials, unoptimized C++, stl string, pcre library):
Baseline (just printing values, no calculation):
real 0m0.203s
user 0m0.204s
sys 0m0.000s
Raw string matching, with no possibility of partial matches (stl string):
real 0m0.721s
user 0m0.708s
sys 0m0.012s
Regex string matching (pcrecpp):
real 0m0.845s
user 0m0.836s
sys 0m0.008s
So, in this case, you are right and I am wrong. :)
I think the takeaway is that it will be very rare for you to come up against a situation where a simple regex will be significantly slower than a list of strings for any general purpose. There are places where it will be faster and more readable, even with the NFA backtracking algorithm. If you are writing the inner loop of a web service that will get hit fifteen thousand times a second, maybe you want to test out both options. :P
@James: Yeah, but many people don't have (easy) access to a DFA regex engine, since most of the popular regex engines (Perl, PCRE, Java, .NET, CF, JavaScript, Ruby, Python, JGsoft, and many others) use traditional NFA engines, for good reason. By hand-crafting a regex for minimal alternation as I've shown above, there will be little to no difference between the performance of NFA and DFA algorithms in these cases. The result of this NFA optimization will become more obvious as the number of alternatives increases. The data you're searching through and the particular regex engine you use (which will potentially apply additional optimizations to the matching process) also certainly play a role.
@Ben: There are two fundamentally different regex engine algorithms, commonly called NFA and DFA. Most people are familiar with NFAs, since they have been adopted by most programming languages and other tools due to their much greater expressive power. The most common place to find DFAs is with some *nix command line tools and editors. The advantage of DFAs is that they are generally very fast, and immune to regex efficiency issues. There is no need to hand-craft a regex for efficiency with a DFA, as generally any regexes which match the same text will take the same amount of time to do so. They achieve this magic by keeping track of all possible ongoing matches at once, and always finding the longest possible match (this avoids the inefficiency of backtracking). A small number of tools will try a DFA first and switch to an NFA if more advanced features are used in the regex, and Tcl uses a true hybrid engine which supports many features typically associated with NFAs while avoiding much of the need for hand optimizing. However, the bottom line is that most people don't need to care about how DFAs, Tcl's engine, or POSIX NFAs work, since most modern tools use traditional NFAs.
Just as a minor correction to myself after rereading the above... I'd said that in cases like those illustrated here there could potentially be "little to no difference between the performance of NFA and DFA algorithms." Actually, with all but the most simple regexes, the DFA algorithm will be more efficient since although we can minimize the alternation that NFAs will have to deal with, we can't remove it altogether.
@Steve,
Thanks for the clarification. If the DFA engines keep track of all matches simultaneously, I assume they eat up much more memory as they bypass the need to backtrack.