Posts Tagged “regular expressions”

This evening I got a comment on my post about how to detect when cell phones access your web site. It was a question about the way I went about the PHP code for matching a bunch of small text snippets against the User Agent string.

Basically, I put all the text snippets in a large array and then used a foreach loop to match each one individually against the User Agent string. A visitor to the page asked why I hadn't done a very large regular expression, essentially including all those snippets with "or" symbols between them. So instead of "match 1, match 2, match 3," I'd get "match 1 or 2 or 3". She asked: "Is this more efficient processing, or more organized for you, or avoids some possible error getting thrown?"

Well, the initial answer was: "I just didn't think of doing a giant regex. I like arrays and my mind went in that direction."

But that only answers why I did it, not whether it's better. And while this is such a small piece of code that you can execute it 10,000 times in under 1.7 seconds, so it's not going to increase your server load significantly, I realized that on a very high traffic web site, cutting execution time even on such a small piece of code could make a difference. So I tested it.

I ran both methods (the foreach and the array as a giant regex) against a string of 190-195 characters in 10,000 iteration batches, and moved the bit of matching text around the string.

I got approximately the same execution time for the foreach method no matter where the matching text was in the string. But with the giant regex, the execution times varied by huge amounts. The farther the matching text was from the beginning of the string, the slower the giant regex worked. When the matching text was in the first 10 characters, the giant regex was 6-7 times faster than the foreach method. If it was around the 30th character, around 4 times faster.

But when the matching text was somewhere around the 150th character, the giant regex took 50-60% longer than the foreach method. When there was no match whatsoever, the giant regex took over twice as long.

I then tried increasing the length of the string (with no matches in it). The performance disparity grew deeper and deeper. At 760 characters, the giant regex was taking around 4.25 times longer than the foreach method. At 1520 characters (around 250 words), the giant regex had slipped to 5.4 times longer than the foreach method. Yet if there was a match in the first 10 characters of the 1520-character string, the giant regex took no longer than if they were in the first ten characters of a 190-character string.

Now because User Agent strings generally aren't too long and you don't have to go too deep to get a match, in this specific case, using the giant regex would probably be more computationally efficient. And if you were running this snippet millions of times a day, you could see tangible gains from optimizing it. But as a rule of thumb, the foreach method is going to be the better general purpose choice.

So, Jane, I hope that answered your question.

  • Share/Bookmark

Comments 1 Comment »

Replacing a short bit of text in PHP is easy. You just use str_replace, as in $bobo = str_replace("foo","bar",$bobo). That will replace all instances of "foo" with "bar" in the string $bobo.

But, on a recent contract, I had a client's site where it only worked properly in Microsoft Internet Explorer and part of my task was to get it to work similarly in all browsers. One of the things they had done was use ALT tags, so that when you rolled your mouse over a graphic, the alt text would pop up as a tool tip. The problem is that this is a non-standard behavior. Only Microsoft Internet Explorer pops up alt tags. So how to get the alt text to pop up in all browsers? Javascript, of course.

I'm not going to go into the Javascript. There are many different ways to do a pop-up tooltip. The trick, though, was to get all those alt texts fed to the tooltip script. Now, there may have been a way in the DOM to get the element's alt text and feed it to the script, but I wasn't sure and didn't have a whole lot of time to hunt it down. Instead I wrote up a quickie PHP script to handle it. This is not a CGI, but a script I ran from the command line. And you don't need to be running from the Linux or Mac command line. You can install PHP on Windows XP or Vista and run it from the DOS command line.

The first part of the program recursed the directory with all of the HTML files, and each time it found an HTML file, it fed it to a function in the script that read it, altered it, and saved it. I'll write a post on how to do that soon. For now, I'm just going to get into the alteration.

Let's assume I had a line of HTML like this...

<img src="fred.jpg" alt="This is a picture of Republican presidential candidate, Fred Thompson" height=300 width=190 border=0>

As I read through the file, line by line (each line being $line), I applied the following bit of code...

if(preg_match("/alt=\"[^\"]+\"/i",$line){
   $newline = preg_replace("/alt=\"([^\"]+)\"/i","alt=\"\\1\" onMouseOver='toolTip(\"\\1\")'",$line);
} else {
   $newline = $line;
}

That will turn...

<img src="fred.jpg" alt="This is a picture of Republican presidential candidate, Fred Thompson" height=300 width=190 border=0>

... into ...

<img src="fred.jpg" alt="This is a picture of Republican presidential candidate, Fred Thompson" onMouseOver='toolTip("This is a picture of Republican presidential candidate, Fred Thompson")' height=300 width=190 border=0>

Now, how did that work? What preg_replace does is store all the parts of your regular expression that are within parentheses in a series of numbered strings. So alt=\"([^\"]+)\" set everything inside the quotes after alt= as a string which could be re-used. So we replaced alt=\"([^\"]+)\" with itself (alt=\"\\1\") plus a call to the JavaScript tooltip function using the same text as in the ALT tag (onMouseOver='toolTip(\"\\1\")').

You could also do parentheses within parentheses, capturing a larger piece of text and then capturing a smaller subset of that piece of text. The system counts the parentheses from left to right. For simplification, I'll demonstrate this with a sentence and how the pieces would break out.

Fred (Thompson is a (Republican candidate for (President)) of the) United (States)

That would break out as...
1: Thompson is a Republican candidate for President of the
2: Republican candidate for President
3: President
4: States

And that's how that works. I'll do a post soon on how to go through a directory and use it on all files.

  • Share/Bookmark

Comments No Comments »

Validation is the art of checking and modifying input from users so that it's all of what your web site needs and none of what it doesn't want. Validation can be as complex as anti-hacking and anti-spam measures or as simple as making sure someone's not putting their phone number in the box meant for their birthday.

The recent MySpace hack, where band pages were hacked to cover them up with an invisible link that sent you to a malware infection site, showed that it can be dangerous to let users put their HTML in your pages. To allow users to customize their pages with everything from images to new backgrounds, MySpace allowed some pretty complex user-generated HTML. Hackers took advantage of that to use some CSS tricks to create those page overlays.

But let's say you're building a site where you want your users to be able to have some formatting control over content they submit. How do you allow some HTML but not other HTML? There are three ways to do this... exclusion, inclusion, and substitution.

Validating Through Exclusion

Validating through exclusion may seem simple at first. You just look for the tags you don't want and kill them. You can write rules that eliminate <SCRIPT> tags, "onClick" and "onMouseOver", etc. But you'll find that this list grows and grows. It's a cat and mouse game as hackers try to discover exploits you haven't thought to guard against yet.

Exclusion is nice in that it allows you to allow your users a much broader range of options. They can do everything that isn't explicitly prohibited. But if your list of exclusions isn't spot on, you'll be playing catch-up.

Validating Through Inclusion

An inclusion style of validation comes at the problem from the opposite direction. Instead of trying to catch the stuff you know to be bad, you allow the stuff you believe to be good. It's a sort of HTML guest list; the tags on the list get into the party, but the tags on the list don't. You can even enforce a dress code.

To do this, you run a global search of the input to find all the HTML tags in it, then compare each one against a set of rules. If it matches one of the rules, it's good and gets through. If not, then it's bad and you can handle it by deleting it, showing the user an error message, banning the user from your site forever, whatever you want.

For example, let's say you're going to allow them to use a FONT tag and they can specify three characteristics: size, color, and the font face. Here's a regular expression you can use that will check to see if a tag is a conforming FONT tag.

/<\/?font(( face=\"[^\">]*\")?|( size=\"\d\d?\")?|( color=\"[^\">]*\")?)* *?\/?>/i

Now we'll break it down...

/<\/?: The first forward slash opens the regular expression like a < opens an HTML tag. So the expression starts by looking for the opening <. The backslash (\) tells the interpreter not to see the forward slash (/) as the end of the regular expression, but as just a normal forward slash. Then the question mark after it means it can occur zero or one times, meaning, this first bit will match "<" or "</".

font: The simplest part, following the "<" or "</", there must be "font".

(( face=\"[^\">]*\")?|( size=\"\d\d?\")?|( color=\"[^\">]*\")?)*: This is the part that looks for the font tag modifiers. We'll break it down from the inside out. Inside the parentheses, we have three sub-expressions in parentheses, each followed by a question mark.

( face=\"[^\">]*\")?|: This looks for a space, the word "face" followed by an equals sign, followed by a quotation mark, followed by zero or more characters that arent a quotation mark or a >, followed by another quotation mark. The question mark outside those parentheses says it should occur zero or one times. That's followed by a pipe symbol(|) which means "or".

( size=\"\d\d?\")?|: This is a more specific check. Since you only want a number in there, we have "\d\d?" The "\d" stands for any digit from 0 to 9, so "/d/d?" means the pattern requires one digit from 0-9 followed by zero or one more digits, allowing any value from 0-99. If you wanted the value to just be 0-9 (which is actually fairly reasonable for font size values), you'd just take out the "\d?". Like the pattern before it, it's followed by a question mark, meaning it should occur zero or one times, and a pipe (|) which means "or".

( color=\"[^\">]*\")?: This final pattern looks for a color value which can be anything that doesn't contain a quotation mark or a > and occurs zero or one times.

(...)*: All of that is enclosed in a set of parentheses followed by a star. The parentheses makes them into a set, basically "(1 or 2 or 3)" and the star means "zero or more times". So the whole pattern is any number of instances of those three tags any number of times, but only those three tags. So you could specify the font face twice or not at all, but you could only specify the face, the color, or the size.

 *?\/?>/i: The space-star-question combo is zero or more spaces occurring zero or one times, then the possibility of a closing forward slash within the tag for someone who is XHMTL crazy, the closing > and then the forward slash to close the pattern. Then it's followed by an "i" which is a modifier that tells it the matching should be case insensitive. Without that, a tag that started with <FONT instead of <font would get tossed as a non-match.

Inclusion works very well if you know precisely what you want to allow, can write good patterns for matching it, and can communicate what's allowed to your users. It's also good if you're using a JavaScript-based WYSIWYG text editor like Tiny MCE that lets your users format their text like they were working in a word processor. Generally, across most browsers (except Safari), you'll be able to get consistent, predictable HTML submitted to the form from WYSIWYG editors, making it possible to develop an inclusion list because you know exactly which tags the editor will return.

The advantage of using an inclusion validation method combined with a WYSIWYG editor is that you don't have to tell your users what's acceptable HTML and what isn't. They're one step removed from the HTML, and as long as the editor's controls are pretty intuitive or self-explanatory, your users have little or no learning curve for formatting their text in guestbooks, blog comments, or bulletin boards, but you have the ability to lock down that formatting with fairly solid precision.

Validating Through Substitution

Validating through substitiution is a form of inclusion, but it exerts even tighter control. Instead of using HTML, users are required to use a different form of mark-up which a browser would never recognize. Then, at some point, your scripts parse the mark-up into HTML, so the final product is formatted the way the user wanted.

An example of this is BBCode, which is used by a number of bulletin board and forum systems. Instead of <b> to start bolding text, you use [b]. Somewhere in the processing engine (usually during the code that retrieves your post from the database and displays it to a visitor) a <b> is substituted for the [b] you used.

The problem with a substitution method is that it requires your users to learn a new form of mark-up and is a lot harder to get working with a WYSIWYG editor, so it requires users who are a bit more savvy and either know the substitution's mark-up or are willing to spend the time to learn.

On the other hand, substitution can be done on a smaller scale, such as simply substituting smiley-face graphics for emoticons... :-)

Summing It All Up

Basically, the types of validation can be summed up like this...

Exclusion: Toss everything I know to be bad and let everything else through.
Inclusion: Allow everything I know to be good through and toss everything else.
Substitution: Look for special non-functional codes and make them functional, but toss all other codes.

What you choose should be determined by how much freedom you want to offer your users, how much of a learning curve you want them to go through, and how much time you want to spend maintaining your validation methods.

Best of luck!

  • Share/Bookmark

Comments 2 Comments »

Get an angel for your site An Angel Watches Over This Site