Start of String and End of String Anchors (Page 9)
In previous sections, we explored how literal characters and character classes operate in regular expressions. These match specific characters in a string. Anchors, however, are different. They match positions in the string rather than characters, allowing you to "anchor" your regex to the start or end of a string or line.
Using the Caret (^
) Anchor
The caret (^
) matches the position before the first character of the string. For example:
-
^a
applied to "abc" matches "a." -
^b
does not match "abc" because "b" is not the first character of the string.
The caret is useful when you want to ensure that a match occurs at the very beginning of a string.
Example:
Regex | String | Matches |
---|---|---|
^a
|
"abc" | Yes |
^b
|
"abc" | No |
Using the Dollar Sign ($
) Anchor
The dollar sign ($
) matches the position after the last character of the string. For example:
-
c$
matches "c" in "abc." -
a$
does not match "abc" because "a" is not the last character.
Example:
Regex | String | Matches |
---|---|---|
c$
|
"abc" | Yes |
a$
|
"abc" | No |
Practical Use Cases
Anchors are essential for validating user input. For instance, if you want to ensure a user inputs only an integer number, using \d+
will accept any input containing digits, even if it includes letters (e.g., "abc123").
Instead, use ^\d+$
to enforce that the entire string consists only of digits from start to finish.
Example in Perl:
if ($input =~ /^\d+$/) {
print "Valid integer";
} else {
print "Invalid input";
}
To handle potential leading or trailing whitespace, use:
-
^\s+
to match leading whitespace. -
\s+$
to match trailing whitespace.
In Perl, you can trim whitespace like this:
$input =~ s/^\s+|\s+$//g;
Multi-Line Mode
If your string contains multiple lines, you might want to match the start or end of each line instead of the entire string. Multi-line mode changes the behavior of the anchors:
-
^
matches at the start of each line. -
$
matches at the end of each line.
Example:
Given the string:
first line
second line
-
^s
matches "s" in "second line" when multi-line mode is enabled.
Activating Multi-Line Mode
In Perl, use the m
flag:
m/^regex$/m;
In .NET, specify RegexOptions.Multiline
:
Regex.Match("string", "regex", RegexOptions.Multiline);
In tools like EditPad Pro, GNU Emacs, and PowerGREP, multi-line mode is enabled by default.
Permanent Start and End Anchors
The anchors \A
and \Z
match the start and end of the string, respectively, regardless of multi-line mode:
-
\A
: Matches only at the start of the string. -
\Z
: Matches only at the end of the string, before any newline character. -
\z
: Matches only at the very end of the string, including after a newline character.
For example:
Regex | String | Matches |
---|---|---|
\Aabc
|
"abc" | Yes |
abc\Z
|
"abc\n" | Yes |
abc\z
|
"abc\n" | No |
Some regex flavors, like JavaScript, POSIX, and XML, do not support \A
and \Z
. In such cases, use the caret (^
) and dollar sign ($
) instead.
Zero-Length Matches
Anchors match positions rather than characters, resulting in zero-length matches. For example:
-
^
matches the start of a string. -
$
matches the end of a string.
Example:
Using ^\d*$
to validate a number will accept an empty string. This happens because the regex matches the position at the start of the string and the zero-length match caused by the star quantifier.
To avoid this, ensure your regex accounts for actual input:
^\d+$
Adding a Prefix to Each Line
In some scenarios, you may want to add a prefix to each line of a multi-line string. For example, to prepend a "> " to each line in an email reply, use multi-line mode:
Example in VB.NET:
Dim Quoted As String = Regex.Replace(Original, "^", "> ", RegexOptions.Multiline)
This regex matches the start of each line and inserts the prefix "> " without removing any characters.
Special Cases with Line Breaks
There is an exception to how $
and \Z
behave. If the string ends with a line break, $
and \Z
match before the line break, not at the very end of the string.
For example:
-
The string "joe\n" will match
^[a-z]+$
and\A[a-z]+\Z
. -
However,
\A[a-z]+\z
will not match because\z
requires the match to be at the very end of the string, including after the newline.
Use \z
to ensure a match at the absolute end of the string.
Looking Inside the Regex Engine
Let’s see what happens when we apply ^4$
to the string:
749
486
4
In multi-line mode, the regex engine processes the string as follows:
-
The engine starts at the first character, "7". The
^
matches the position before "7". -
The engine advances to
4
, and^
cannot match because it is not preceded by a newline. - The process continues until the engine reaches the final "4", which is preceded by a newline.
-
The
^
matches the position before "4", and the engine successfully matches4
. -
The engine attempts to match
$
at the position after "4", and it succeeds because it is the end of the string.
The regex engine reports the match as "4" at the end of the string.
Caution for Programmers
When working with anchors, be mindful of zero-length matches. For example, $
can match the position after the last character of the string. Querying for String[Regex.MatchPosition]
may result in an access violation or segmentation fault if the match position points to the void after the string. Handle these cases carefully in your code.
0 Comments
Recommended Comments
There are no comments to display.