Alternation with the Vertical Bar or Pipe Symbol (Page 11)
Previously, we explored how character classes allow you to match a single character out of several possible options. Alternation, on the other hand, enables you to match one of several possible regular expressions.
The vertical bar or pipe symbol (|
) is used for alternation. It acts as an OR operator within a regex.
Basic Syntax
To search for either "cat" or "dog," use the pattern:
cat|dog
You can add more options as needed:
cat|dog|mouse|fish
The regex engine will match any of these options. For example:
Regex | String | Matches |
---|---|---|
**`cat | dog`** | "I have a cat and a dog" |
**`cat | dog`** | "I have a fish" |
Precedence and Grouping
The alternation operator has the lowest precedence among all regex operators. This means the regex engine will try to match everything to the left or right of the vertical bar. If you need to control the scope of the alternation, use round brackets (()
) to group expressions.
Example:
Without grouping:
\bcat|dog\b
This regex will match:
- A word boundary followed by "cat"
- "dog" followed by a word boundary
With grouping:
\b(cat|dog)\b
This regex will match:
- A word boundary, then either "cat" or "dog," followed by another word boundary.
Regex | String | Matches |
---|---|---|
**`\bcat | dog\b`** | "I saw a catdog" |
**`\b(cat | dog)\b`** | "I saw a catdog" |
Understanding Regex Engine Behavior
The regex engine is eager, meaning it stops searching as soon as it finds a valid match. The order of alternatives matters.
Consider the pattern:
Get|GetValue|Set|SetValue
When applied to the string "SetValue," the engine will:
-
Try to match
Get
, but fail. -
Try
GetValue
, but fail. -
Match
Set
and stop.
The result is that the engine matches "Set," but not "SetValue." This happens because the engine found a valid match early and stopped.
Solutions to Eagerness
There are several ways to address this behavior:
1. Change the Order of Options
By changing the order of options, you can ensure longer matches are attempted first:
GetValue|Get|SetValue|Set
This way, "SetValue" will be matched before "Set."
2. Use Optional Groups
You can combine related options and use ?
to make parts of them optional:
Get(Value)?|Set(Value)?
This pattern ensures "GetValue" is matched before "Get," and "SetValue" before "Set."
3. Use Word Boundaries
To ensure you match whole words only, use word boundaries:
\b(Get|GetValue|Set|SetValue)\b
Alternatively, use:
\b(Get(Value)?|Set(Value)?)\b
Or even better:
\b(Get|Set)(Value)?\b
This pattern is more efficient and concise.
POSIX Regex Behavior
Unlike most regex engines, POSIX-compliant regex engines always return the longest possible match, regardless of the order of alternatives. In a POSIX engine, applying Get|GetValue|Set|SetValue
to "SetValue" will return "SetValue," not "Set." This behavior is due to the POSIX standard, which prioritizes the longest match.
Summary
Alternation is a powerful feature in regex that allows you to match one of several possible patterns. However, due to the eager behavior of most regex engines, it’s essential to order your alternatives carefully and use grouping to ensure accurate matches. By understanding how the engine processes alternation, you can write more effective and optimized regex patterns.
0 Comments
Recommended Comments
There are no comments to display.