-
Blog Entries
-
By Jessica Brown in Jessica BrownMost regular expression engines discussed in this tutorial support the following four matching modes:
Modifier Description /i Makes the regex case-insensitive. /s Enables "single-line mode," making the dot (.) match newlines. /m Enables "multi-line mode," allowing caret (^) and dollar ($) to match at the start and end of each line. /x Enables "free-spacing mode," where whitespace is ignored, and # can be used for comments. Specifying Modes Inside The Regular Expression
You can specify these modes within a regex using mode modifiers. For example:
(?i) turns on case-insensitive matching.
(?s) enables single-line mode.
(?m) enables multi-line mode.
(?x) enables free-spacing mode.
Example:
(?i)hello matches "HELLO" Turning Modes On and Off for Only Part of the Regex
Modern regex flavors allow you to apply modifiers to specific parts of the regex:
(?i-sm) turns on case-insensitive mode while turning off single-line and multi-line modes.
To apply a modifier to only a part of the regex, you can use the following syntax:
(?i)word(?-i)Word This pattern makes "word" case-insensitive but "Word" case-sensitive.
Modifier Spans
Modifier spans apply modes to a specific section of the regex:
(?i:word) makes "word" case-insensitive.
(?i:case)(?-i:sensitive) applies mixed modes within the regex.
Example:
(?i:ignorecase)(?-i:casesensitive) Summary
Understanding matching modes is essential for writing efficient and accurate regex patterns. By leveraging modes like case-insensitivity, single-line, multi-line, and free-spacing, you can create more flexible and maintainable regular expressions.
-
By Jessica Brown in Jessica BrownUnicode regular expressions are essential for working with text in multiple languages and character sets. As the world becomes more interconnected, supporting Unicode is increasingly important for ensuring that software can handle diverse text inputs.
What is Unicode?
Unicode is a standardized character set that encompasses characters and glyphs from all human languages, both living and dead. It aims to provide a consistent way to represent characters from different languages, eliminating the need for language-specific character sets.
Challenges with Unicode in Regular Expressions
Working with Unicode introduces unique challenges:
Characters, Code Points, and Graphemes:
A single character (grapheme) may be represented by multiple code points. For example, the letter "à" can be represented as: A single code point: U+00E0 Two code points: U+0061 ("a") + U+0300 (grave accent) Regular expressions that treat code points as characters may fail to match graphemes correctly. Combining Marks:
Combining marks are code points that modify the preceding character. For example, U+0300 (grave accent) is a combining mark that can be applied to many base characters. Matching Unicode Graphemes
To match a single Unicode grapheme (character), use:
Perl, RegexBuddy, PowerGREP: \X Java, .NET: \P{M}\p{M}* Example:
\X matches a grapheme \P{M}\p{M}* matches a base character followed by zero or more combining marks Matching Specific Code Points
To match a specific Unicode code point, use:
JavaScript, .NET, Java: \uFFFF (FFFF is the hexadecimal code point) Perl, PCRE: \x{FFFF} Unicode Character Properties
Unicode defines properties that categorize characters based on their type. You can match characters belonging to specific categories using:
Positive Match: \p{Property} Negative Match: \P{Property} Common Properties:
\p{L} - Letter \p{Lu} - Uppercase Letter \p{Ll} - Lowercase Letter \p{N} - Number \p{P} - Punctuation \p{S} - Symbol \p{Z} - Separator \p{C} - Other (Control Characters) Unicode Scripts and Blocks
Unicode groups characters into scripts and blocks:
Scripts: Collections of characters used by a particular language or writing system. Blocks: Contiguous ranges of code points. Example Scripts:
\p{Latin} \p{Greek} \p{Cyrillic} Example Blocks:
\p{InBasic_Latin} \p{InGreek_and_Coptic} \p{InCyrillic} Best Practices for Unicode Regex
Use \X to match graphemes when supported. Be aware of different ways to encode characters. Normalize input to avoid mismatches due to different encodings. Use Unicode properties to match character categories. Use scripts and blocks to match specific writing systems. -
By Jessica Brown in Jessica BrownNamed capturing groups allow you to assign names to capturing groups, making it easier to reference them in complex regular expressions. This feature is available in most modern regular expression engines.
Why Use Named Capturing Groups?
In traditional regular expressions, capturing groups are referenced by their numbers (e.g., \1, \2). As the number of groups increases, it becomes harder to manage and understand which group corresponds to which part of the match. Named capturing groups solve this problem by allowing you to reference groups by descriptive names.
Example (Traditional):
(\d{4})-(\d{2})-(\d{2}) In this pattern, you would reference the year as \1, the month as \2, and the day as \3.
Example (Named):
(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2}) Now, you can reference the year as year, the month as month, and the day as day, making the regex more readable and maintainable.
Named Capture Syntax by Flavor
Python, PCRE, and PHP
These flavors use the following syntax for named capturing groups:
(?P<name>group) To reference the named group inside the regex, use:
(?P=name) To reference it in replacement text, use:
\g<name> Example:
(?P<word>\w+)\s+(?P=word) This pattern matches doubled words like "the the".
.NET Framework
The .NET regex engine uses its own syntax for named capturing groups:
(?<name>group) or (?'name'group) To reference the named group inside the regex, use:
\k<name> or \k'name' In replacement text, use:
${name} Example:
(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2}) This pattern matches a date in YYYY-MM-DD format. You can reference the named groups in replacement text like:
${year}/${month}/${day} Multiple Groups with the Same Name
In the .NET framework, you can have multiple capturing groups with the same name. This is useful when you have different patterns that should capture the same kind of data.
Example:
a(?<digit>[0-5])|b(?<digit>[4-7]) In this pattern, both groups are named digit. The capturing group will contain the matched digit, regardless of which alternative was matched.
Note:
Python and PCRE do not allow multiple groups with the same name. Attempting to do so will result in a compilation error. Numbering of Named Groups
The way capturing groups are numbered varies between regex flavors:
Python and PCRE
Both named and unnamed capturing groups are numbered from left to right.
(a)(?P<x>b)(c)(?P<y>d) In this pattern:
Group 1: (a) Group 2: (?P<x>b) Group 3: (c) Group 4: (?P<y>d) In replacement text, you can reference these groups as \1, \2, \3, and \4.
.NET Framework
The .NET framework handles named groups differently. Named groups are numbered after all unnamed groups.
(a)(?<x>b)(c)(?<y>d) In this pattern:
Group 1: (a) Group 2: (c) Group 3: (?<x>b) Group 4: (?<y>d) In replacement text, you would reference the groups as:
$1 for (a) $2 for (c) $3 for (?<x>b) $4 for (?<y>d) To avoid confusion, it’s best to reference named groups by their names rather than their numbers in the .NET framework.
Best Practices
To ensure compatibility across different regex flavors and avoid confusion, follow these best practices:
Do not mix named and unnamed groups. Use either all named groups or all unnamed groups. Use non-capturing groups for parts of the regex that don’t need to be captured: (?:group) Use descriptive names for capturing groups to make your regex more readable. JGsoft Engine
The JGsoft regex engine (used in tools like EditPad Pro and PowerGREP) supports both Python-style and .NET-style named capturing groups.
Python-style named groups are numbered along with unnamed groups. .NET-style named groups are numbered after unnamed groups. Multiple groups with the same name are allowed. Summary
Named capturing groups make regular expressions more readable and maintainable. Different regex flavors have varying syntaxes and behaviors for named groups. To write portable and efficient regex patterns:
Use named groups to improve readability. Avoid mixing named and unnamed groups. Use non-capturing groups when capturing is unnecessary. By understanding how different regex engines handle named groups, you can write more robust and compatible regex patterns across various programming languages and tools.
-
By Jessica Brown in Jessica BrownIn regular expressions, round brackets (()) are used for grouping. Grouping allows you to apply operators to multiple tokens at once. For example, you can make an entire group optional or repeat the entire group using repetition operators.
Basic Usage
For example:
Set(Value)? This pattern matches:
"Set" "SetValue" The round brackets group "Value", and the question mark makes it optional.
Note:
Square brackets ([]) define character classes. Curly braces ({}) specify repetition counts. Only round brackets (()) are used for grouping. Backreferences
Round brackets not only group parts of a regex but also create backreferences. A backreference stores the text matched by the group, allowing you to reuse it later in the regex or replacement text.
Example:
Set(Value)? If "SetValue" is matched, the backreference \1 will contain "Value". If only "Set" is matched, the backreference will be empty.
To prevent creating a backreference, use non-capturing parentheses:
Set(?:Value)? The (?: ... ) syntax disables capturing, making the regex more efficient when backreferences are not needed.
Using Backreferences in Replacement Text
Backreferences are often used in search-and-replace operations. The exact syntax for using backreferences in replacement text varies between tools and programming languages.
For example, in many tools:
\1 refers to the first capturing group. \2 refers to the second capturing group, and so on. In replacement text, you can use these backreferences to reinsert matched text:
Find: (\w+)\s+\1 Replace: \1 This pattern finds doubled words like "the the" and replaces them with a single instance.
Using Backreferences in the Regex
Backreferences can also be used within the regex itself to match the same text again.
Example:
<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1> This pattern matches an HTML tag and its corresponding closing tag. The opening tag name is captured in the first backreference, and \1 is used to ensure the closing tag matches the same name.
Numbering Backreferences
Backreferences are numbered based on the order of opening brackets in the regex:
The first opening bracket creates backreference \1. The second opening bracket creates backreference \2. Non-capturing groups do not count toward the numbering.
Example:
([a-c])x\1x\1 This pattern matches:
"axaxa" "bxbxb" "cxcxc" If a group is optional and not matched, the backreference will be empty, but the regex will still work.
Looking Inside the Regex Engine
Let’s see how the regex engine processes the following pattern:
<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1> when applied to the string:
Testing <B><I>bold italic</I></B> text The engine matches <B> and stores "B" in the first backreference. It skips over the text until it finds the closing </B>. The backreference \1 ensures the closing tag matches the same name as the opening tag. The entire match is <B><I>bold italic</I></B>. Backreferences to Failed Groups
There’s a difference between a backreference to a group that matched nothing and one to a group that did not participate at all:
Example:
(q?)b\1 This pattern matches "b" because the optional q? matched nothing.
In contrast:
(q)?b\1 This pattern fails to match "b" because the group (q) did not participate in the match at all.
In most regex flavors, a backreference to a non-participating group causes the match to fail. However, in JavaScript, backreferences to non-participating groups match an empty string.
Forward References and Invalid References
Some modern regex flavors, like .NET, Java, and Perl, allow forward references. A forward reference is a backreference to a group that appears later in the regex.
Example:
(\2two|(one))+ This pattern matches "oneonetwo". The forward reference \2 fails at first but succeeds when the group is matched during repetition.
In most flavors, referencing a group that doesn’t exist results in an error. In JavaScript and Ruby, such references result in a zero-width match.
Repetition and Backreferences
The regex engine doesn’t permanently substitute backreferences in the regex. Instead, it uses the most recent value captured by the group.
Example:
([abc]+)=\1 This pattern matches "cab=cab".
In contrast:
([abc])+\1 This pattern does not match "cab" because the backreference holds only the last value captured by the group (in this case, "b").
Useful Example: Checking for Doubled Words
You can use the following regex to find doubled words in a text:
\b(\w+)\s+\1\b In your text editor, replace the doubled word with \1 to remove the duplicate.
Example:
Input: "the the cat" Output: "the cat" Limitations
Round brackets cannot be used inside character classes. For example: [(a)b] This pattern matches the literal characters "a", "b", "(", and ")".
Backreferences also cannot be used inside character classes. In most flavors, \1 inside a character class is treated as an octal escape sequence. Example:
(a)[\1b] This pattern matches "a" followed by either \x01 (an octal escape) or "b".
Summary
Grouping with round brackets allows you to:
Apply operators to entire groups of tokens. Create backreferences for reuse in the regex or replacement text. Use non-capturing groups (?: ... ) to avoid creating unnecessary backreferences and improve performance. Be mindful of the limitations and differences in behavior across various regex flavors.
-
By Jessica Brown in Jessica BrownIn addition to the question mark, regex provides two more repetition operators: the asterisk (*) and the plus (+).
Basic Usage
The * (star) matches the preceding token zero or more times. The + (plus) matches the preceding token one or more times.
For example:
<[A-Za-z][A-Za-z0-9]*> This pattern matches HTML tags without attributes:
<[A-Za-z] matches the first letter. [A-Za-z0-9]* matches zero or more alphanumeric characters after the first letter. This regex will match tags like:
<B> <HTML> If you used + instead of *, the regex would require at least one alphanumeric character after the first letter, making it match:
<HTML> but not <1>. Limiting Repetition
Modern regex flavors allow you to limit repetitions using curly braces ({}).
Syntax:
{min,max} min: Minimum number of matches. max: Maximum number of matches. Examples:
{0,} is equivalent to *. {1,} is equivalent to +. {3} matches exactly three repetitions. Example:
\b[1-9][0-9]{3}\b This pattern matches numbers between 1000 and 9999.
\b[1-9][0-9]{2,4}\b This pattern matches numbers between 100 and 99999.
The word boundaries (\b) ensure that only complete numbers are matched.
Watch Out for Greediness!
All repetition operators (*, +, and {}) are greedy by default. This means the regex engine will try to match as much text as possible.
Example:
Consider the pattern:
<.+> When applied to the string:
This is a <EM>first</EM> test. You might expect it to match <EM> and </EM> separately. However, it will match <EM>first</EM> instead.
This happens because the + is greedy and matches as many characters as possible.
Looking Inside the Regex Engine
The first token in the regex is <, which matches the first < in the string.
The next token is the . (dot), which matches any character except newlines. The + causes the dot to repeat as many times as possible:
The dot matches E, then M, and so on. It continues matching until the end of the string. At this point, the > token fails to match because there are no more characters left. The engine then backtracks and tries to reduce the match length until > matches the next character.
The final match is <EM>first</EM>.
Laziness Instead of Greediness
To fix this issue, make the quantifier lazy by adding a question mark (?😞
<.+?> This tells the engine to match as few characters as possible.
The < matches the first <. The . matches E. The engine checks for > and finds a match right after EM. The final match is <EM>, which is what we intended.
An Alternative to Laziness
Instead of using lazy quantifiers, you can use a negated character class:
<[^>]+> This pattern matches any sequence of characters that are not >, followed by >. It avoids backtracking and improves performance.
Example:
Given the string:
This is a <EM>first</EM> test. The regex <[^>]+> will match:
<EM> </EM> This approach is more efficient because it reduces backtracking, which can significantly improve performance in large datasets or tight loops.
Summary
The *, +, and {} quantifiers control repetition in regex. They are greedy by default, but you can make them lazy by adding a question mark (?). Using negated character classes is another way to handle repetition efficiently without backtracking.
-
-
-
Topics
-
- 0 replies
- 7 views
-
Programming Challenge: IP Address Validator (Jan 9, 2025)
By Jessica Brown, in Programming Challenges
- 0 replies
- 8 views
-
- 0 replies
- 5 views
-
- 1 reply
- 7 views
-
Breaking Boundaries: Empowering Women in Technology
By Jessica Brown, in Welcome to the Women in IT Club!
- 0 replies
- 6 views
-