Validation is the art of checking and modifying input from users so that it's all of what your web site needs and none of what it doesn't want. Validation can be as complex as anti-hacking and anti-spam measures or as simple as making sure someone's not putting their phone number in the box meant for their birthday.
The recent MySpace hack, where band pages were hacked to cover them up with an invisible link that sent you to a malware infection site, showed that it can be dangerous to let users put their HTML in your pages. To allow users to customize their pages with everything from images to new backgrounds, MySpace allowed some pretty complex user-generated HTML. Hackers took advantage of that to use some CSS tricks to create those page overlays.
But let's say you're building a site where you want your users to be able to have some formatting control over content they submit. How do you allow some HTML but not other HTML? There are three ways to do this... exclusion, inclusion, and substitution.
Validating Through Exclusion
Validating through exclusion may seem simple at first. You just look for the tags you don't want and kill them. You can write rules that eliminate <SCRIPT> tags, "onClick" and "onMouseOver", etc. But you'll find that this list grows and grows. It's a cat and mouse game as hackers try to discover exploits you haven't thought to guard against yet.
Exclusion is nice in that it allows you to allow your users a much broader range of options. They can do everything that isn't explicitly prohibited. But if your list of exclusions isn't spot on, you'll be playing catch-up.
Validating Through Inclusion
An inclusion style of validation comes at the problem from the opposite direction. Instead of trying to catch the stuff you know to be bad, you allow the stuff you believe to be good. It's a sort of HTML guest list; the tags on the list get into the party, but the tags on the list don't. You can even enforce a dress code.
To do this, you run a global search of the input to find all the HTML tags in it, then compare each one against a set of rules. If it matches one of the rules, it's good and gets through. If not, then it's bad and you can handle it by deleting it, showing the user an error message, banning the user from your site forever, whatever you want.
For example, let's say you're going to allow them to use a FONT tag and they can specify three characteristics: size, color, and the font face. Here's a regular expression you can use that will check to see if a tag is a conforming FONT tag.
/<\/?font(( face=\"[^\">]*\")?|( size=\"\d\d?\")?|( color=\"[^\">]*\")?)* *?\/?>/i
Now we'll break it down...
/<\/?: The first forward slash opens the regular expression like a < opens an HTML tag. So the expression starts by looking for the opening <. The backslash (\) tells the interpreter not to see the forward slash (/) as the end of the regular expression, but as just a normal forward slash. Then the question mark after it means it can occur zero or one times, meaning, this first bit will match "<" or "</".
font: The simplest part, following the "<" or "</", there must be "font".
(( face=\"[^\">]*\")?|( size=\"\d\d?\")?|( color=\"[^\">]*\")?)*: This is the part that looks for the font tag modifiers. We'll break it down from the inside out. Inside the parentheses, we have three sub-expressions in parentheses, each followed by a question mark.
( face=\"[^\">]*\")?|: This looks for a space, the word "face" followed by an equals sign, followed by a quotation mark, followed by zero or more characters that arent a quotation mark or a >, followed by another quotation mark. The question mark outside those parentheses says it should occur zero or one times. That's followed by a pipe symbol(|) which means "or".
( size=\"\d\d?\")?|: This is a more specific check. Since you only want a number in there, we have "\d\d?" The "\d" stands for any digit from 0 to 9, so "/d/d?" means the pattern requires one digit from 0-9 followed by zero or one more digits, allowing any value from 0-99. If you wanted the value to just be 0-9 (which is actually fairly reasonable for font size values), you'd just take out the "\d?". Like the pattern before it, it's followed by a question mark, meaning it should occur zero or one times, and a pipe (|) which means "or".
( color=\"[^\">]*\")?: This final pattern looks for a color value which can be anything that doesn't contain a quotation mark or a > and occurs zero or one times.
(...)*: All of that is enclosed in a set of parentheses followed by a star. The parentheses makes them into a set, basically "(1 or 2 or 3)" and the star means "zero or more times". So the whole pattern is any number of instances of those three tags any number of times, but only those three tags. So you could specify the font face twice or not at all, but you could only specify the face, the color, or the size.
*?\/?>/i: The space-star-question combo is zero or more spaces occurring zero or one times, then the possibility of a closing forward slash within the tag for someone who is XHMTL crazy, the closing > and then the forward slash to close the pattern. Then it's followed by an "i" which is a modifier that tells it the matching should be case insensitive. Without that, a tag that started with <FONT instead of <font would get tossed as a non-match.
The advantage of using an inclusion validation method combined with a WYSIWYG editor is that you don't have to tell your users what's acceptable HTML and what isn't. They're one step removed from the HTML, and as long as the editor's controls are pretty intuitive or self-explanatory, your users have little or no learning curve for formatting their text in guestbooks, blog comments, or bulletin boards, but you have the ability to lock down that formatting with fairly solid precision.
Validating Through Substitution
Validating through substitiution is a form of inclusion, but it exerts even tighter control. Instead of using HTML, users are required to use a different form of mark-up which a browser would never recognize. Then, at some point, your scripts parse the mark-up into HTML, so the final product is formatted the way the user wanted.
An example of this is BBCode, which is used by a number of bulletin board and forum systems. Instead of <b> to start bolding text, you use [b]. Somewhere in the processing engine (usually during the code that retrieves your post from the database and displays it to a visitor) a <b> is substituted for the [b] you used.
The problem with a substitution method is that it requires your users to learn a new form of mark-up and is a lot harder to get working with a WYSIWYG editor, so it requires users who are a bit more savvy and either know the substitution's mark-up or are willing to spend the time to learn.
On the other hand, substitution can be done on a smaller scale, such as simply substituting smiley-face graphics for emoticons...
Summing It All Up
Basically, the types of validation can be summed up like this...
Exclusion: Toss everything I know to be bad and let everything else through.
Inclusion: Allow everything I know to be good through and toss everything else.
Substitution: Look for special non-functional codes and make them functional, but toss all other codes.
What you choose should be determined by how much freedom you want to offer your users, how much of a learning curve you want them to go through, and how much time you want to spend maintaining your validation methods.
Best of luck!