Repetition with Star and Plus (Page 13)
In addition to the question mark, regex provides two more repetition operators: the asterisk (*
) and the plus (+
).
Basic Usage
The *
(star) matches the preceding token zero or more times. The +
(plus) matches the preceding token one or more times.
For example:
<[A-Za-z][A-Za-z0-9]*>
This pattern matches HTML tags without attributes:
-
<[A-Za-z]
matches the first letter. -
[A-Za-z0-9]*
matches zero or more alphanumeric characters after the first letter.
This regex will match tags like:
-
<B>
-
<HTML>
If you used +
instead of *
, the regex would require at least one alphanumeric character after the first letter, making it match:
-
<HTML>
but not<1>
.
Limiting Repetition
Modern regex flavors allow you to limit repetitions using curly braces ({}
).
Syntax:
{min,max}
-
min
: Minimum number of matches. -
max
: Maximum number of matches.
Examples:
-
{0,}
is equivalent to*
. -
{1,}
is equivalent to+
. -
{3}
matches exactly three repetitions.
Example:
\b[1-9][0-9]{3}\b
This pattern matches numbers between 1000 and 9999.
\b[1-9][0-9]{2,4}\b
This pattern matches numbers between 100 and 99999.
The word boundaries (\b
) ensure that only complete numbers are matched.
Watch Out for Greediness!
All repetition operators (*
, +
, and {}
) are greedy by default. This means the regex engine will try to match as much text as possible.
Example:
Consider the pattern:
<.+>
When applied to the string:
This is a <EM>first</EM> test.
You might expect it to match <EM>
and </EM>
separately. However, it will match <EM>first</EM>
instead.
This happens because the +
is greedy and matches as many characters as possible.
Looking Inside the Regex Engine
The first token in the regex is <
, which matches the first <
in the string.
The next token is the .
(dot), which matches any character except newlines. The +
causes the dot to repeat as many times as possible:
-
The dot matches
E
, thenM
, and so on. - It continues matching until the end of the string.
-
At this point, the
>
token fails to match because there are no more characters left.
The engine then backtracks and tries to reduce the match length until >
matches the next character.
The final match is <EM>first</EM>
.
Laziness Instead of Greediness
To fix this issue, make the quantifier lazy by adding a question mark (?
😞
<.+?>
This tells the engine to match as few characters as possible.
-
The
<
matches the first<
. -
The
.
matchesE
. -
The engine checks for
>
and finds a match right afterEM
.
The final match is <EM>
, which is what we intended.
An Alternative to Laziness
Instead of using lazy quantifiers, you can use a negated character class:
<[^>]+>
This pattern matches any sequence of characters that are not >
, followed by >
. It avoids backtracking and improves performance.
Example:
Given the string:
This is a <EM>first</EM> test.
The regex <[^>]+>
will match:
-
<EM>
-
</EM>
This approach is more efficient because it reduces backtracking, which can significantly improve performance in large datasets or tight loops.
Summary
The *
, +
, and {}
quantifiers control repetition in regex. They are greedy by default, but you can make them lazy by adding a question mark (?
). Using negated character classes is another way to handle repetition efficiently without backtracking.
0 Comments
Recommended Comments
There are no comments to display.