Repetition with Star and Plus (Page 13)
In addition to the question mark, regex provides two more repetition operators: the asterisk (*) and the plus (+).
Basic Usage
The * (star) matches the preceding token zero or more times. The + (plus) matches the preceding token one or more times.
For example:
<[A-Za-z][A-Za-z0-9]*>
This pattern matches HTML tags without attributes:
<[A-Za-z]matches the first letter.[A-Za-z0-9]*matches zero or more alphanumeric characters after the first letter.
This regex will match tags like:
<B><HTML>
If you used + instead of *, the regex would require at least one alphanumeric character after the first letter, making it match:
<HTML>but not<1>.
Limiting Repetition
Modern regex flavors allow you to limit repetitions using curly braces ({}).
Syntax:
{min,max}
min: Minimum number of matches.max: Maximum number of matches.
Examples:
{0,}is equivalent to*.{1,}is equivalent to+.{3}matches exactly three repetitions.
Example:
\b[1-9][0-9]{3}\b
This pattern matches numbers between 1000 and 9999.
\b[1-9][0-9]{2,4}\b
This pattern matches numbers between 100 and 99999.
The word boundaries (\b) ensure that only complete numbers are matched.
Watch Out for Greediness!
All repetition operators (*, +, and {}) are greedy by default. This means the regex engine will try to match as much text as possible.
Example:
Consider the pattern:
<.+>
When applied to the string:
This is a <EM>first</EM> test.
You might expect it to match <EM> and </EM> separately. However, it will match <EM>first</EM> instead.
This happens because the + is greedy and matches as many characters as possible.
Looking Inside the Regex Engine
The first token in the regex is <, which matches the first < in the string.
The next token is the . (dot), which matches any character except newlines. The + causes the dot to repeat as many times as possible:
The dot matches
E, thenM, and so on.It continues matching until the end of the string.
At this point, the
>token fails to match because there are no more characters left.
The engine then backtracks and tries to reduce the match length until > matches the next character.
The final match is <EM>first</EM>.
Laziness Instead of Greediness
To fix this issue, make the quantifier lazy by adding a question mark (?๐
<.+?>
This tells the engine to match as few characters as possible.
The
<matches the first<.The
.matchesE.The engine checks for
>and finds a match right afterEM.
The final match is <EM>, which is what we intended.
An Alternative to Laziness
Instead of using lazy quantifiers, you can use a negated character class:
<[^>]+>
This pattern matches any sequence of characters that are not >, followed by >. It avoids backtracking and improves performance.
Example:
Given the string:
This is a <EM>first</EM> test.
The regex <[^>]+> will match:
<EM></EM>
This approach is more efficient because it reduces backtracking, which can significantly improve performance in large datasets or tight loops.
The *, +, and {} quantifiers control repetition in regex. They are greedy by default, but you can make them lazy by adding a question mark (?). Using negated character classes is another way to handle repetition efficiently without backtracking.
Recommended Comments