Repetition with Star and Plus (Page 13)

(0 reviews)

https://codenamejessica.com/blogs/entry/104-repetition-with-star-and-plus-page-13/

In addition to the question mark, regex provides two more repetition operators: the asterisk (*) and the plus (+).

Basic Usage

The * (star) matches the preceding token zero or more times. The + (plus) matches the preceding token one or more times.

For example:

<[A-Za-z][A-Za-z0-9]*>

This pattern matches HTML tags without attributes:

<[A-Za-z] matches the first letter.
[A-Za-z0-9]* matches zero or more alphanumeric characters after the first letter.

This regex will match tags like:


<HTML>

If you used + instead of *, the regex would require at least one alphanumeric character after the first letter, making it match:

<HTML> but not <1>.

Limiting Repetition

Modern regex flavors allow you to limit repetitions using curly braces ({}).

Syntax:

{min,max}

min: Minimum number of matches.
max: Maximum number of matches.

Examples:

{0,} is equivalent to *.
{1,} is equivalent to +.
{3} matches exactly three repetitions.

Example:

\b[1-9][0-9]{3}\b

This pattern matches numbers between 1000 and 9999.

\b[1-9][0-9]{2,4}\b

This pattern matches numbers between 100 and 99999.

The word boundaries (\b) ensure that only complete numbers are matched.

Watch Out for Greediness!

All repetition operators (*, +, and {}) are greedy by default. This means the regex engine will try to match as much text as possible.

Example:

Consider the pattern:

<.+>

When applied to the string:

This is a <EM>first</EM> test.

You might expect it to match  and  separately. However, it will match first instead.

This happens because the + is greedy and matches as many characters as possible.

Looking Inside the Regex Engine

The first token in the regex is <, which matches the first < in the string.

The next token is the . (dot), which matches any character except newlines. The + causes the dot to repeat as many times as possible:

The dot matches E, then M, and so on.
It continues matching until the end of the string.
At this point, the > token fails to match because there are no more characters left.

The engine then backtracks and tries to reduce the match length until > matches the next character.

The final match is first.

Laziness Instead of Greediness

To fix this issue, make the quantifier lazy by adding a question mark (?😞

<.+?>

This tells the engine to match as few characters as possible.

The < matches the first <.
The . matches E.
The engine checks for > and finds a match right after EM.

The final match is , which is what we intended.

An Alternative to Laziness

Instead of using lazy quantifiers, you can use a negated character class:

<[^>]+>

This pattern matches any sequence of characters that are not >, followed by >. It avoids backtracking and improves performance.

Example:

Given the string:

This is a <EM>first</EM> test.

The regex <[^>]+> will match:

This approach is more efficient because it reduces backtracking, which can significantly improve performance in large datasets or tight loops.

The *, +, and {} quantifiers control repetition in regex. They are greedy by default, but you can make them lazy by adding a question mark (?). Using negated character classes is another way to handle repetition efficiently without backtracking.

Table of Contents

Sign In

Welcome to CodeNameJessica

✨ Welcome to CodeNameJessica! ✨

Repetition with Star and Plus (Page 13)

Basic Usage

Limiting Repetition

Syntax:

Example:

Watch Out for Greediness!

Example:

Looking Inside the Regex Engine

Laziness Instead of Greediness

An Alternative to Laziness

Example:

0 Comments

Recommended Comments

Important Information

Account

Navigation

Search