Regular Expressions; also called Regex is a way of matching and/or changing or replacing patterns within a string or array. PHP supports two kinds of Regex, Perl Compatable Regular Expressions (PCRE) and POSIX. The syntax of POSIX Regex is simpler but you can do more with PCRE's.
Regex is a language within PHP and has a diffrent syntax, Some characters that mean one thing in PHP mean something else in a PHP statement. The syntax is actually fairly simple but since it is so compact, it isn't very easy to get at first. Instead of functions with descriptive names, a single character or two is used to do things. There may be many ways to write a Regex that will do what you want and there will surely be many ways to write a Regex that you think will do what you want but it will not. Even it your Regex doesn't cause PHP errors, you need to do some testing to make sure it does what you think it does.
To show the difference in the syntax of POSIX and PCRE Regular Expressions, I will compare the POSIX function, ereg() to the PCRE function preg_match(). While both can be used in conditional testing, the syntax is different.
<?
$string = "tryeyeqwertjujuyu";
if ( ereg("qwert", $string) == true )
{ echo "qwert found"; }
else
{ echo "not found"; }
?>
The example above uses ereg() to check if the string "qwert" is in the larger string, "$string" and if so returns, "qwert found". If you want a case sensitive match, you can use eregi(). The next example does a case insensitive match with preg_match()
<?
$string = "tryeyeqwertjujuyu";
if ( preg_match("#qwert#", $string) == true )
{ echo "qwert found"; }
else
{ echo "not found"; }
?>
The only difference between the two examples is that the search string, "qwert" is enclosed in #'s. For PCRE's, your search string must be enclosed by delimiters. By convention they are forward slashes but they can be any non-alphanumeric character except the backslash. If the delimiter is in the search string, it must be escaped. I prefer to # or @ because they are more visible than slashes so I can debug a script easier. Between the second delimiter and quote, you can add PCRE modifiers to change the way the Regex is interpreted. For example, if you wanted to match the links in a web page, it has to be case insensitive.To do that, you use the "i" modifiers:
| PCRE Modifiers | |
|---|---|
| i | Case insensitive |
| e | Replacement string in preg_replace() treated as PHP code |
| m | $ and ^ anchors match newlines as well as beginning and end of sting. |
| s | Matches newlines (Newlines not normally matched by .) |
| x | |
| A | Matches pattern only at start of string. |
| E | Matches pattern only at end of string. |
| U | Makes Regex ungreedy |
Suppose you have a form for users to add comments to a page and it is getting spammed by viagra spammers. You have added strip_tags() to keep them from adding URLs but they keep adding text with the word "viagra" in it. To block it, you can use ereg() or eregi(). ereg() and eregi() are the POSIX kind of Regex and have a simpler syntax. eregi() is a case-insensitive version of ereg() so we will use that. If the text input is named "in", the string will be named "$in". To do a case-insensitive search for the word viagra:
if ( isset($in) )
{
if ( eregi("viagra", $in) )
{
die("<font size=\"7\"><b>Spam not
allowed</b></font>");
}
eregi() is used in an if statement like isset(). If "viagra" is found, eregi() returns True and the die() statement is executed.
Meta characters are characters that have meaning to Regex. To use them in a Regex string without meaning to Regex, you must escape them with a backslash. The following characters have meaning in Regex:
If any special character including the backslash is used in a Regex, it must be escape by putting it after a backslash ( \ ).
You can use a "." (period) as a wildcard for any single character including alphanumeric characters, puncution and spaces. It does not match new lines
After a while, your spammer starts substituting characters to get past your filters. For example, he uses a "@" for an "a" and a "|" (pipe, shift-\) or "1" or "l" (lower case "L"). To match one of a group of characters, enclose them in square brackets:
if ( eregi("v[il1\|][a@]gr[a@]", $in) )
The first, group, [il1\|], matches i or l or 1 or |. Since | (pipe) has meaning in Regex, it is escaped
To match a range of characters, separate the first and last characters with a - and enclose them in brackets. For example, to match a single hex value, you match two ranges, 0 through 9 and a through f like this:
= srtolower();
eregi("[0-1a-z]", $in)
One way to match a hex color value is to repeat the above Regex six times:
eregi("[0-1a-z][0-1a-z][0-1a-z][0-1a-z][0-1a-z][0-1a-z]", $in)
but you have a problem. What happens if someone enters "123456789"? The Regex will return True because the string contains "123456" even though it also contains numbers outside of your test range and the string is too long. One way to test for the string being just want is to use strlen() to make sure that the string is only six characters long.
Since the string you are testing for is composed of the sub-string [0-9a-f] repeating six times, you can use {3} to make the Regex shorter.
|
|
|