The Dot Matches (Almost) Any Character (Page 8)
The dot, or period, is one of the most versatile and commonly used metacharacters in regular expressions. However, it is also one of the most misused.
The dot matches any single character except for newline characters. In most regex flavors discussed in this tutorial, the dot does not match newlines by default. This behavior stems from the early days of regex when tools were line-based and processed text line by line. In such cases, the text would not contain newline characters, so the dot could safely match any character.
In modern tools, you can enable an option to make the dot match newline characters as well. For example, in tools like RegexBuddy, EditPad Pro, or PowerGREP, you can check a box labeled "dot matches newline."
Single-Line Mode
In Perl, the mode that makes the dot match newline characters is called single-line mode. You can activate this mode by adding the s
flag to the regex, like this:
m/^regex$/s;
Other languages and regex libraries, such as the .NET framework, have adopted this terminology. In .NET, you can enable single-line mode by using the RegexOptions.Singleline
option:
Regex.Match("string", "regex", RegexOptions.Singleline);
In most programming languages and libraries, enabling single-line mode only affects the behavior of the dot. It has no impact on other aspects of the regex.
However, some languages like JavaScript and VBScript do not have a built-in option to make the dot match newlines. In such cases, you can use a character class like [\s\S]
to achieve the same effect. This class matches any character that is either whitespace or non-whitespace, effectively matching any character.
Use The Dot Sparingly
The dot is a powerful metacharacter that can make your regex very flexible. However, it can also lead to unintended matches if not used carefully. It is easy to write a regex with a dot and find that it matches more than you intended.
Consider the following example:
If you want to match a date in mm/dd/yy format, you might start with the regex:
\d\d.\d\d.\d\d
This regex appears to work at first glance, as it matches "02/12/03". However, it also matches "02512703", where the dots match digits instead of separators.
A better solution is to use a character class to specify valid date separators:
\d\d[- /.]\d\d[- /.]\d\d
This regex matches dates with dashes, spaces, dots, or slashes as separators. Note that the dot inside a character class is treated as a literal character, so it does not need to be escaped.
This regex is still not perfect, as it will match "99/99/99". To improve it further, you can use:
[0-1]\d[- /.][0-3]\d[- /.]\d\d
This regex ensures that the month and day parts are within valid ranges. How perfect your regex needs to be depends on your use case. If you are validating user input, the regex must be precise. If you are parsing data files from a known source, a less strict regex might be sufficient.
Use Negated Character Sets Instead of the Dot
Using the dot can sometimes result in overly broad matches. Instead, consider using negated character sets to specify what characters you do not want to match.
For example, to match a double-quoted string, you might be tempted to use:
".*"
At first, this regex seems to work well, matching "string" in:
Put a "string" between double quotes.
However, if you apply it to:
Houston, we have a problem with "string one" and "string two". Please respond.
The regex will match:
"string one" and "string two"
This is not what you intended. The dot matches any character, and the star (*
) quantifier allows it to match across multiple strings, leading to an overly greedy match.
To fix this, use a negated character set instead of the dot:
"[^"]*"
This regex matches any sequence of characters that are not double quotes, enclosed within double quotes. If you also want to prevent matching across multiple lines, use:
"[^"\r\n]*"
This regex ensures that the match does not include newline characters.
By using negated character sets instead of the dot, you can make your regex patterns more precise and avoid unintended matches.
0 Comments
Recommended Comments
There are no comments to display.