Lookahead and lookbehind, often referred to collectively as "lookaround," are powerful constructs introduced in Perl 5 and supported by most modern regular expression engines. They are also known as zero-width assertions because they don’t consume characters in the input string. Instead, they simply assert whether a certain condition is true at a given position without including the matched text in the overall match result.
Lookaround constructs allow you to build more flexible and efficient regex patterns that would otherwise be lengthy or impossible to achieve using traditional methods.
What Are Zero-Width Assertions?
Zero-width assertions, like start (^
) and end ($
) anchors, match positions in a string rather than actual characters. The key difference is that lookaround assertions inspect the text ahead or behind a position to check if a certain pattern is possible, without moving the regex engine's position in the string.
For example, a positive lookahead ensures that a specific pattern follows a certain point, while a negative lookahead ensures that a certain pattern does not follow.
Positive and Negative Lookahead
Lookahead assertions check what comes after a certain position in the string without including it in the match.
Positive Lookahead ((?=...)
)
A positive lookahead ensures that a particular sequence of characters follows the current position. For example, the regex q(?=u)
matches the letter "q" only if it’s immediately followed by a "u," but it doesn’t include the "u" in the match result.
Negative Lookahead ((?!...)
)
A negative lookahead ensures that a specific sequence does not follow the current position. For instance, q(?!u)
matches a "q" only if it’s not followed by a "u."
Here’s how the regex engine processes the negative lookahead q(?!u)
when applied to different strings:
- For the string "Iraq", the regex matches the "q" because there’s no "u" immediately after it.
- For the string "quit", the regex does not match the "q" because it’s followed by a "u."
Positive and Negative Lookbehind
Lookbehind assertions work similarly but check what comes before the current position in the string.
Positive Lookbehind ((?<=...)
)
A positive lookbehind ensures that a specific pattern precedes the current position. For example, (?<=a)b
matches the letter "b" only if it’s preceded by an "a."
- In the word "cab", the regex matches the "b" because it’s preceded by an "a."
- In the word "bed", the regex does not match the "b" because it’s preceded by a "d."
Negative Lookbehind ((?<!...)
)
A negative lookbehind ensures that a certain pattern does not precede the current position. For example, (?<!a)b
matches a "b" only if it’s not preceded by an "a."
- In the word "bed", the regex matches the "b" because it’s not preceded by an "a."
- In the word "cab", the regex does not match the "b" because it is preceded by an "a."
Using Lookbehind for More Complex Patterns
Unlike lookahead, which allows any regular expression inside, lookbehind assertions are more limited in some regex flavors. Many engines require lookbehind patterns to have a fixed length because the regex engine needs to know exactly how far to step back in the string.
For example, the regex (?<=abc)d
will match the "d" in the string "abcd", but the lookbehind must be of fixed length in engines like Python and Perl.
Some modern engines, such as Java and PCRE, allow lookbehind patterns of varying lengths, provided they have a finite maximum length. For example, (?<=a|ab|abc)d
would be valid in these engines, as each alternative has a fixed length.
Lookaround in Practice: A Comparison
Consider the following two regex patterns for matching words that don’t end with "s":
-
\b\w+(?<!s)\b
-
\b\w+[^s]\b
When applied to the word "John's", the first pattern matches "John", while the second matches "John'" (including the apostrophe). The first pattern is generally more accurate and easier to understand.
Limitations of Lookbehind
Not all regex flavors support lookbehind. For instance, JavaScript and Ruby support lookahead but do not support lookbehind. Additionally, even in engines that support lookbehind, some limitations apply:
- Fixed-length requirement: Most regex flavors require lookbehind patterns to have a fixed length.
-
No repetition: You cannot use quantifiers like
*
or+
inside lookbehind.
The only regex engines that allow full regular expressions inside lookbehind are the JGsoft engine and the .NET framework.
The Atomic Nature of Lookaround
One important characteristic of lookaround assertions is that they are atomic. This means that once the lookaround condition is satisfied, the regex engine does not backtrack to try other possibilities inside the lookaround.
For example, consider the regex (?=(\d+))\w+\1
applied to the string "123x12":
-
The lookahead
(?=(\d+))
matches the digits "123" and captures them into\1
. -
The
\w+
token matches the entire string. -
The engine backtracks until
\w+
matches only the "1" at the start of the string. -
The engine tries to match
\1
but fails because it cannot find "123" again at any position.
Since lookaround is atomic, the backtracking steps inside the lookahead are discarded, preventing further permutations from being tried.
However, if you apply the same regex to the string "456x56", it will match "56x56" because the backtracking steps align with the repeated digits.
Summary
Lookahead and lookbehind are essential tools for creating complex regex patterns. They allow you to assert conditions without consuming characters in the string.
Quick Reference for Lookaround Constructs:
Construct | Description | Example | Matches | Does Not Match |
---|---|---|---|---|
(?=...)
|
Positive Lookahead |
q(?=u)
|
"quit" | "qit" |
(?!...)
|
Negative Lookahead |
q(?!u)
|
"qit" | "quit" |
(?<=...)
|
Positive Lookbehind |
(?<=a)b
|
"cab" | "bed" |
(?<!...)
|
Negative Lookbehind |
(?<!a)b
|
"bed" | "cab" |
Use lookaround assertions carefully to optimize your regex patterns without accidentally excluding valid matches.
Recommended Comments
Join the conversation
You are posting as a guest. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.