Jump to content
  • entries
    25
  • comments
    0
  • views
    249

Repetition with Star and Plus (Page 13)


In addition to the question mark, regex provides two more repetition operators: the asterisk (*) and the plus (+).


Basic Usage

The * (star) matches the preceding token zero or more times. The + (plus) matches the preceding token one or more times.

For example:

<[A-Za-z][A-Za-z0-9]*>

This pattern matches HTML tags without attributes:

  • <[A-Za-z] matches the first letter.
  • [A-Za-z0-9]* matches zero or more alphanumeric characters after the first letter.

This regex will match tags like:

  • <B>
  • <HTML>

If you used + instead of *, the regex would require at least one alphanumeric character after the first letter, making it match:

  • <HTML> but not <1>.

Limiting Repetition

Modern regex flavors allow you to limit repetitions using curly braces ({}).

Syntax:

{min,max}
  • min: Minimum number of matches.
  • max: Maximum number of matches.

Examples:

  • {0,} is equivalent to *.
  • {1,} is equivalent to +.
  • {3} matches exactly three repetitions.

Example:

\b[1-9][0-9]{3}\b

This pattern matches numbers between 1000 and 9999.

\b[1-9][0-9]{2,4}\b

This pattern matches numbers between 100 and 99999.

The word boundaries (\b) ensure that only complete numbers are matched.


Watch Out for Greediness!

All repetition operators (*, +, and {}) are greedy by default. This means the regex engine will try to match as much text as possible.

Example:

Consider the pattern:

<.+>

When applied to the string:

This is a <EM>first</EM> test.

You might expect it to match <EM> and </EM> separately. However, it will match <EM>first</EM> instead.

This happens because the + is greedy and matches as many characters as possible.


Looking Inside the Regex Engine

The first token in the regex is <, which matches the first < in the string.

The next token is the . (dot), which matches any character except newlines. The + causes the dot to repeat as many times as possible:

  1. The dot matches E, then M, and so on.
  2. It continues matching until the end of the string.
  3. At this point, the > token fails to match because there are no more characters left.

The engine then backtracks and tries to reduce the match length until > matches the next character.

The final match is <EM>first</EM>.


Laziness Instead of Greediness

To fix this issue, make the quantifier lazy by adding a question mark (?😞

<.+?>

This tells the engine to match as few characters as possible.

  1. The < matches the first <.
  2. The . matches E.
  3. The engine checks for > and finds a match right after EM.

The final match is <EM>, which is what we intended.


An Alternative to Laziness

Instead of using lazy quantifiers, you can use a negated character class:

<[^>]+>

This pattern matches any sequence of characters that are not >, followed by >. It avoids backtracking and improves performance.

Example:

Given the string:

This is a <EM>first</EM> test.

The regex <[^>]+> will match:

  • <EM>
  • </EM>

This approach is more efficient because it reduces backtracking, which can significantly improve performance in large datasets or tight loops.


Summary

The *, +, and {} quantifiers control repetition in regex. They are greedy by default, but you can make them lazy by adding a question mark (?). Using negated character classes is another way to handle repetition efficiently without backtracking.

0 Comments


Recommended Comments

There are no comments to display.

×
×
  • Create New...

Important Information

Terms of Use Privacy Policy Guidelines We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.