Regular Expressions: A Bunch of Little Ones or One Big One?
Jan 14th, 2008 by Greg Bulmash
This evening I got a comment on my post about how to detect when cell phones access your web site. It was a question about the way I went about the PHP code for matching a bunch of small text snippets against the User Agent string.
Basically, I put all the text snippets in a large array and then used a foreach loop to match each one individually against the User Agent string. A visitor to the page asked why I hadn't done a very large regular expression, essentially including all those snippets with "or" symbols between them. So instead of "match 1, match 2, match 3," I'd get "match 1 or 2 or 3". She asked: "Is this more efficient processing, or more organized for you, or avoids some possible error getting thrown?"
Well, the initial answer was: "I just didn't think of doing a giant regex. I like arrays and my mind went in that direction."
But that only answers why I did it, not whether it's better. And while this is such a small piece of code that you can execute it 10,000 times in under 1.7 seconds, so it's not going to increase your server load significantly, I realized that on a very high traffic web site, cutting execution time even on such a small piece of code could make a difference. So I tested it.
I ran both methods (the foreach and the array as a giant regex) against a string of 190-195 characters in 10,000 iteration batches, and moved the bit of matching text around the string.
I got approximately the same execution time for the foreach method no matter where the matching text was in the string. But with the giant regex, the execution times varied by huge amounts. The farther the matching text was from the beginning of the string, the slower the giant regex worked. When the matching text was in the first 10 characters, the giant regex was 6-7 times faster than the foreach method. If it was around the 30th character, around 4 times faster.
But when the matching text was somewhere around the 150th character, the giant regex took 50-60% longer than the foreach method. When there was no match whatsoever, the giant regex took over twice as long.
I then tried increasing the length of the string (with no matches in it). The performance disparity grew deeper and deeper. At 760 characters, the giant regex was taking around 4.25 times longer than the foreach method. At 1520 characters (around 250 words), the giant regex had slipped to 5.4 times longer than the foreach method. Yet if there was a match in the first 10 characters of the 1520-character string, the giant regex took no longer than if they were in the first ten characters of a 190-character string.
Now because User Agent strings generally aren't too long and you don't have to go too deep to get a match, in this specific case, using the giant regex would probably be more computationally efficient. And if you were running this snippet millions of times a day, you could see tangible gains from optimizing it. But as a rule of thumb, the foreach method is going to be the better general purpose choice.
So, Jane, I hope that answered your question.
Well thank you Greg!