Jump to content

Welcome to CodeNameJessica

Welcome to CodeNameJessica!

💻 Where tech meets community.

Hello, Guest! 👋
You're just a few clicks away from joining an exclusive space for tech enthusiasts, problem-solvers, and lifelong learners like you.

🔐 Why Join?
By becoming a member of CodeNameJessica, you’ll get access to:
In-depth discussions on Linux, Security, Server Administration, Programming, and more
Exclusive resources, tools, and scripts for IT professionals
A supportive community of like-minded individuals to share ideas, solve problems, and learn together
Project showcases, guides, and tutorials from our members
Personalized profiles and direct messaging to collaborate with other techies

🌐 Sign Up Now and Unlock Full Access!
As a guest, you're seeing just a glimpse of what we offer. Don't miss out on the complete experience! Create a free account today and start exploring everything CodeNameJessica has to offer.

  • Entries

    47
  • Comments

    0
  • Views

    3975

Entries in this blog

Character classes, also known as character sets, allow you to define a set of characters that a regex engine should match at a specific position in the text. To create a character class, place the desired characters between square brackets. For instance, to match either an a or an e, use the pattern [ae]. This can be particularly useful when dealing with variations in spelling, such as in the regex gr[ae]y, which will match both "gray" and "grey."

Key Points About Character Classes:

  • A character class matches only a single character.

  • The order of characters inside a character class does not affect the outcome.

For example, gr[ae]y will not match "graay" or "graey," as the class only matches one character from the set at a time.


Using Ranges in Character Classes

You can specify a range of characters within a character class by using a hyphen (-). For example:

  • [0-9] matches any digit from 0 to 9.

  • [a-fA-F] matches any letter from a to f, regardless of case.

You can also combine multiple ranges and individual characters within a character class:

  • [0-9a-fxA-FX] matches any hexadecimal digit or the letter X.

Again, the order of characters inside the class does not matter.


Useful Applications of Character Classes

Here are some practical use cases for character classes:

  • sep[ae]r[ae]te: Matches "separate" or "seperate" (common spelling errors).

  • li[cs]en[cs]e: Matches "license" or "licence."

  • [A-Za-z_][A-Za-z_0-9]*: Matches identifiers in programming languages.

  • 0[xX][A-Fa-f0-9]+: Matches C-style hexadecimal numbers.


Negated Character Classes

By adding a caret (^) immediately after the opening square bracket, you create a negated character class. This instructs the regex engine to match any character not in the specified set.

For example:

  • q[^u]: Matches a q followed by any character except u.

However, it’s essential to remember that a negated character class still requires a character to follow the initial match. For instance, q[^u] will match the q and the space in "Iraq is a country," but it will not match the q in "Iraq" by itself.

To ensure that the q is not followed by a u, use negative lookahead: q(?!u). We will cover lookaheads later in this tutorial.


Metacharacters Inside Character Classes

Inside character classes, most metacharacters lose their special meaning. However, a few characters retain their special roles:

  • Closing bracket (])

  • Backslash (\)

  • Caret (^) (only if it appears immediately after the opening bracket)

  • Hyphen (-) (only if placed between characters to specify a range)

To include these characters as literals:

  • Backslash (\) must be escaped as [\].

  • Caret (^) can appear anywhere except right after the opening bracket.

  • Closing bracket (]) can be placed right after the opening bracket or caret.

  • Hyphen (-) can be placed at the start or end of the class.

Examples:

  • [x^] matches x or ^.

  • []x] matches ] or x.

  • [^]x] matches any character that is not ] or x.

  • [-x] matches x or -.


Shorthand Character Classes

Shorthand character classes are predefined character sets that simplify your regex patterns. Here are the most common shorthand classes:

Shorthand

Meaning

Equivalent Character Class

\d

Any digit

[0-9]

\w

Any word character

[A-Za-z0-9_]

\s

Any whitespace character

[ \t\r\n]

Details:

  • \d matches digits from 0 to 9.

  • \w includes letters, digits, and underscores.

  • \s matches spaces, tabs, and line breaks. In some flavors, it may also include form feeds and vertical tabs.

The characters included in these shorthand classes may vary depending on the regex flavor. For example:

  • JavaScript treats \d and \w as ASCII-only but includes Unicode characters for \s.

  • XML handles \d and \w as Unicode but limits \s to ASCII characters.

  • Python allows you to control what the shorthand classes match using specific flags.

Shorthand character classes can be used both inside and outside of square brackets:

  • \s\d matches a whitespace character followed by a digit.

  • [\s\d] matches a single character that is either whitespace or a digit.

For instance, when applied to the string "1 + 2 = 3":

  • \s\d matches the space and the digit 2.

  • [\s\d] matches the digit 1.

The shorthand [\da-fA-F] matches a hexadecimal digit and is equivalent to [0-9a-fA-F].


Negated Shorthand Character Classes

The primary shorthand classes also have negated versions:

  • \D: Matches any character that is not a digit. Equivalent to [^\d].

  • \W: Matches any character that is not a word character. Equivalent to [^\w].

  • \S: Matches any character that is not whitespace. Equivalent to [^\s].

Be careful when using negated shorthand inside square brackets. For example:

  • [\D\S] is not the same as [^\d\s].

    • [\D\S] will match any character, including digits and whitespace, because a digit is not whitespace and whitespace is not a digit.

    • [^\d\s] will match any character that is neither a digit nor whitespace.


Repeating Character Classes

You can repeat a character class using quantifiers like ?, *, or +:

  • [0-9]+: Matches one or more digits and can match "837" as well as "222".

If you want to repeat the matched character instead of the entire class, you need to use backreferences:

  • ([0-9])\1+: Matches repeated digits, like "222," but not "837."

    • Applied to the string "833337," this regex matches "3333."

If you want more control over repeated matches, consider using lookahead and lookbehind assertions, which we will explore later in the tutorial.


Looking Inside the Regex Engine

As previously discussed, the order of characters inside a character class does not matter. For instance, gr[ae]y can match both "gray" and "grey."

Let’s see how the regex engine processes gr[ae]y step by step:

Given the string:

"Is his hair grey or gray?"
  1. The engine starts at the first character and fails to match g until it reaches the 13th character.

  2. At the 13th character, g matches.

  3. The next token r matches the following character.

  4. The character class [ae] gives the engine two options:

    • First, it tries a, which fails.

    • Then, it tries e, which matches.

  5. The final token y matches the next character, completing the match.

The engine returns "grey" as the match result and stops searching, even though "gray" also exists in the string. This is because the regex engine is eager to report the first valid match it finds.

Understanding how the regex engine processes character classes helps you write more efficient patterns and predict match results more accurately.

Table of Contents

  1. Regular Expression Tutorial

  2. Different Regular Expression Engines

  3. Literal Characters

  4. Special Characters

  5. Non-Printable Characters

  6. First Look at How a Regex Engine Works Internally

  7. Character Classes or Character Sets

  8. The Dot Matches (Almost) Any Character

  9. Start of String and End of String Anchors

  10. Word Boundaries

  11. Alternation with the Vertical Bar or Pipe Symbol

  12. Optional Items

  13. Repetition with Star and Plus

  14. Grouping with Round Brackets

  15. Named Capturing Groups

  16. Unicode Regular Expressions

  17. Regex Matching Modes

  18. Possessive Quantifiers

  19. Understanding Atomic Grouping in Regular Expressions

  20. Understanding Lookahead and Lookbehind in Regular Expressions (Lookaround)

  21. Testing Multiple Conditions on the Same Part of a String with Lookaround

  22. Understanding the \G Anchor in Regular Expressions

  23. Using If-Then-Else Conditionals in Regular Expressions

  24. XML Schema Character Classes and Subtraction Explained

  25. Understanding POSIX Bracket Expressions in Regular Expressions

  26. Adding Comments to Regular Expressions: Making Your Regex More Readable

  27. Free-Spacing Mode in Regular Expressions: Improving Readability

Named capturing groups allow you to assign names to capturing groups, making it easier to reference them in complex regular expressions. This feature is available in most modern regular expression engines.


Why Use Named Capturing Groups?

In traditional regular expressions, capturing groups are referenced by their numbers (e.g., \1, \2). As the number of groups increases, it becomes harder to manage and understand which group corresponds to which part of the match. Named capturing groups solve this problem by allowing you to reference groups by descriptive names.

Example (Traditional):

(\d{4})-(\d{2})-(\d{2})

In this pattern, you would reference the year as \1, the month as \2, and the day as \3.

Example (Named):

(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})

Now, you can reference the year as year, the month as month, and the day as day, making the regex more readable and maintainable.


Named Capture Syntax by Flavor

Python, PCRE, and PHP

These flavors use the following syntax for named capturing groups:

(?P<name>group)

To reference the named group inside the regex, use:

(?P=name)

To reference it in replacement text, use:

\g<name>

Example:

(?P<word>\w+)\s+(?P=word)

This pattern matches doubled words like "the the".


.NET Framework

The .NET regex engine uses its own syntax for named capturing groups:

(?<name>group) or (?'name'group)

To reference the named group inside the regex, use:

\k<name> or \k'name'

In replacement text, use:

${name}

Example:

(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})

This pattern matches a date in YYYY-MM-DD format. You can reference the named groups in replacement text like:

${year}/${month}/${day}

Multiple Groups with the Same Name

In the .NET framework, you can have multiple capturing groups with the same name. This is useful when you have different patterns that should capture the same kind of data.

Example:

a(?<digit>[0-5])|b(?<digit>[4-7])

In this pattern, both groups are named digit. The capturing group will contain the matched digit, regardless of which alternative was matched.

Note:

  • Python and PCRE do not allow multiple groups with the same name. Attempting to do so will result in a compilation error.


Numbering of Named Groups

The way capturing groups are numbered varies between regex flavors:

Python and PCRE

Both named and unnamed capturing groups are numbered from left to right.

(a)(?P<x>b)(c)(?P<y>d)

In this pattern:

  • Group 1: (a)

  • Group 2: (?P<x>b)

  • Group 3: (c)

  • Group 4: (?P<y>d)

In replacement text, you can reference these groups as \1, \2, \3, and \4.

.NET Framework

The .NET framework handles named groups differently. Named groups are numbered after all unnamed groups.

(a)(?<x>b)(c)(?<y>d)

In this pattern:

  • Group 1: (a)

  • Group 2: (c)

  • Group 3: (?<x>b)

  • Group 4: (?<y>d)

In replacement text, you would reference the groups as:

  • $1 for (a)

  • $2 for (c)

  • $3 for (?<x>b)

  • $4 for (?<y>d)

To avoid confusion, it’s best to reference named groups by their names rather than their numbers in the .NET framework.


Best Practices

To ensure compatibility across different regex flavors and avoid confusion, follow these best practices:

  1. Do not mix named and unnamed groups. Use either all named groups or all unnamed groups.

  2. Use non-capturing groups for parts of the regex that don’t need to be captured:

(?:group)
  1. Use descriptive names for capturing groups to make your regex more readable.


JGsoft Engine

The JGsoft regex engine (used in tools like EditPad Pro and PowerGREP) supports both Python-style and .NET-style named capturing groups.

  • Python-style named groups are numbered along with unnamed groups.

  • .NET-style named groups are numbered after unnamed groups.

  • Multiple groups with the same name are allowed.


Summary

Named capturing groups make regular expressions more readable and maintainable. Different regex flavors have varying syntaxes and behaviors for named groups. To write portable and efficient regex patterns:

  • Use named groups to improve readability.

  • Avoid mixing named and unnamed groups.

  • Use non-capturing groups when capturing is unnecessary.

By understanding how different regex engines handle named groups, you can write more robust and compatible regex patterns across various programming languages and tools.

Table of Contents

  1. Regular Expression Tutorial

  2. Different Regular Expression Engines

  3. Literal Characters

  4. Special Characters

  5. Non-Printable Characters

  6. First Look at How a Regex Engine Works Internally

  7. Character Classes or Character Sets

  8. The Dot Matches (Almost) Any Character

  9. Start of String and End of String Anchors

  10. Word Boundaries

  11. Alternation with the Vertical Bar or Pipe Symbol

  12. Optional Items

  13. Repetition with Star and Plus

  14. Grouping with Round Brackets

  15. Named Capturing Groups

  16. Unicode Regular Expressions

  17. Regex Matching Modes

  18. Possessive Quantifiers

  19. Understanding Atomic Grouping in Regular Expressions

  20. Understanding Lookahead and Lookbehind in Regular Expressions (Lookaround)

  21. Testing Multiple Conditions on the Same Part of a String with Lookaround

  22. Understanding the \G Anchor in Regular Expressions

  23. Using If-Then-Else Conditionals in Regular Expressions

  24. XML Schema Character Classes and Subtraction Explained

  25. Understanding POSIX Bracket Expressions in Regular Expressions

  26. Adding Comments to Regular Expressions: Making Your Regex More Readable

  27. Free-Spacing Mode in Regular Expressions: Improving Readability

In addition to the question mark, regex provides two more repetition operators: the asterisk (*) and the plus (+).


Basic Usage

The * (star) matches the preceding token zero or more times. The + (plus) matches the preceding token one or more times.

For example:

<[A-Za-z][A-Za-z0-9]*>

This pattern matches HTML tags without attributes:

  • <[A-Za-z] matches the first letter.

  • [A-Za-z0-9]* matches zero or more alphanumeric characters after the first letter.

This regex will match tags like:

  • <B>

  • <HTML>

If you used + instead of *, the regex would require at least one alphanumeric character after the first letter, making it match:

  • <HTML> but not <1>.


Limiting Repetition

Modern regex flavors allow you to limit repetitions using curly braces ({}).

Syntax:

{min,max}
  • min: Minimum number of matches.

  • max: Maximum number of matches.

Examples:

  • {0,} is equivalent to *.

  • {1,} is equivalent to +.

  • {3} matches exactly three repetitions.

Example:

\b[1-9][0-9]{3}\b

This pattern matches numbers between 1000 and 9999.

\b[1-9][0-9]{2,4}\b

This pattern matches numbers between 100 and 99999.

The word boundaries (\b) ensure that only complete numbers are matched.


Watch Out for Greediness!

All repetition operators (*, +, and {}) are greedy by default. This means the regex engine will try to match as much text as possible.

Example:

Consider the pattern:

<.+>

When applied to the string:

This is a <EM>first</EM> test.

You might expect it to match <EM> and </EM> separately. However, it will match <EM>first</EM> instead.

This happens because the + is greedy and matches as many characters as possible.


Looking Inside the Regex Engine

The first token in the regex is <, which matches the first < in the string.

The next token is the . (dot), which matches any character except newlines. The + causes the dot to repeat as many times as possible:

  1. The dot matches E, then M, and so on.

  2. It continues matching until the end of the string.

  3. At this point, the > token fails to match because there are no more characters left.

The engine then backtracks and tries to reduce the match length until > matches the next character.

The final match is <EM>first</EM>.


Laziness Instead of Greediness

To fix this issue, make the quantifier lazy by adding a question mark (?😞

<.+?>

This tells the engine to match as few characters as possible.

  1. The < matches the first <.

  2. The . matches E.

  3. The engine checks for > and finds a match right after EM.

The final match is <EM>, which is what we intended.


An Alternative to Laziness

Instead of using lazy quantifiers, you can use a negated character class:

<[^>]+>

This pattern matches any sequence of characters that are not >, followed by >. It avoids backtracking and improves performance.

Example:

Given the string:

This is a <EM>first</EM> test.

The regex <[^>]+> will match:

  • <EM>

  • </EM>

This approach is more efficient because it reduces backtracking, which can significantly improve performance in large datasets or tight loops.

The *, +, and {} quantifiers control repetition in regex. They are greedy by default, but you can make them lazy by adding a question mark (?). Using negated character classes is another way to handle repetition efficiently without backtracking.

Table of Contents

  1. Regular Expression Tutorial

  2. Different Regular Expression Engines

  3. Literal Characters

  4. Special Characters

  5. Non-Printable Characters

  6. First Look at How a Regex Engine Works Internally

  7. Character Classes or Character Sets

  8. The Dot Matches (Almost) Any Character

  9. Start of String and End of String Anchors

  10. Word Boundaries

  11. Alternation with the Vertical Bar or Pipe Symbol

  12. Optional Items

  13. Repetition with Star and Plus

  14. Grouping with Round Brackets

  15. Named Capturing Groups

  16. Unicode Regular Expressions

  17. Regex Matching Modes

  18. Possessive Quantifiers

  19. Understanding Atomic Grouping in Regular Expressions

  20. Understanding Lookahead and Lookbehind in Regular Expressions (Lookaround)

  21. Testing Multiple Conditions on the Same Part of a String with Lookaround

  22. Understanding the \G Anchor in Regular Expressions

  23. Using If-Then-Else Conditionals in Regular Expressions

  24. XML Schema Character Classes and Subtraction Explained

  25. Understanding POSIX Bracket Expressions in Regular Expressions

  26. Adding Comments to Regular Expressions: Making Your Regex More Readable

  27. Free-Spacing Mode in Regular Expressions: Improving Readability

Regular expressions can quickly become complex and difficult to understand, especially when dealing with long patterns. To make them easier to read and maintain, many modern regex engines allow you to add comments directly into your regex patterns. This makes it possible to explain what each part of the expression does, reducing confusion and improving readability.


How to Add Comments in Regular Expressions

The syntax for adding a comment inside a regex is:

(?#comment)
  • The text inside the parentheses after ?# is treated as a comment.

  • The regex engine ignores everything inside the comment until it encounters a closing parenthesis ).

  • The comment can be anything you want, as long as it does not include a closing parenthesis.

For example, here’s a regex to match a valid date in the format yyyy-mm-dd, with comments to explain each part:

(?#year)(19|20)\d\d[- /.](?#month)(0[1-9]|1[012])[- /.](?#day)(0[1-9]|[12][0-9]|3[01])

This regex is much more understandable with comments:

  • (?#year): Marks the section that matches the year.

  • (?#month): Marks the section that matches the month.

  • (?#day): Marks the section that matches the day.

Without these comments, the regex would be difficult to decipher at a glance.


Benefits of Using Comments in Regular Expressions

Adding comments to your regex patterns offers several benefits:

  1. Improves readability: Comments clarify the purpose of each section of your regex, making it easier to understand.

  2. Simplifies maintenance: If you need to update a regex later, comments make it easier to remember what each part of the pattern does.

  3. Helps collaboration: When sharing regex patterns with others, comments make it easier for them to follow your logic.


Using Free-Spacing Mode for Better Formatting

In addition to inline comments, many regex engines also support free-spacing mode, which allows you to add spaces and line breaks to your regex without affecting the match.

Free-spacing mode makes your regex more structured and readable by allowing you to organize it into logical sections. To enable free-spacing mode:

  • In Perl, PCRE, Python, and Ruby, use the /x modifier to activate free-spacing mode.

  • In .NET, use the RegexOptions.IgnorePatternWhitespace option.

  • In Java, use the Pattern.COMMENTS flag.

Here’s an example of how free-spacing mode can improve the readability of a regex:

Without Free-Spacing Mode:

(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])

With Free-Spacing Mode and Comments:

(?#year) (19|20) \d\d        # Match years 1900 to 2099
[- /.]                       # Separator (dash, slash, or dot)
(?#month) (0[1-9] | 1[012])  # Match months 01 to 12
[- /.]                       # Separator
(?#day) (0[1-9] | [12][0-9] | 3[01])  # Match days 01 to 31

The second version is far easier to read and maintain.


Which Regex Engines Support Comments?

Most modern regex engines support the (?#comment) syntax for adding comments, including:

Regex Engine

Supports Comments?

Supports Free-Spacing Mode?

JGsoft

Yes

Yes

.NET

Yes

Yes

Perl

Yes

Yes

PCRE

Yes

Yes

Python

Yes

Yes

Ruby

Yes

Yes

Java

No

Yes (via Pattern.COMMENTS)


Example: Using Comments to Document a Complex Regex

Here’s an example of a more complex regex that extracts email addresses from a text file. Without comments, the regex looks like this:

\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b

Adding comments and using free-spacing mode makes it much more understandable:

\b                      # Word boundary to ensure we're at the start of a word
[A-Za-z0-9._%+-]+       # Local part of the email (before @)
@                       # At symbol
[A-Za-z0-9.-]+          # Domain name
\.                      # Dot before the top-level domain
[A-Za-z]{2,}            # Top-level domain (e.g., com, net, org)
\b                      # Word boundary to ensure we're at the end of a word

Key Points to Remember

  • Comments in regex are added using the (?#comment) syntax.

  • Free-spacing mode makes regex patterns more readable by allowing spaces and line breaks.

  • Supported engines include JGsoft, .NET, Perl, PCRE, Python, and Ruby.

  • Java supports free-spacing mode but does not support inline comments.


When to Use Comments and Free-Spacing Mode

Use comments and free-spacing mode when:

  1. Your regex pattern is complex and hard to read.

  2. You’re working on a team and need to make your patterns understandable to others.

  3. You need to revisit your regex after some time and want to avoid deciphering cryptic patterns.

Adding comments and using free-spacing mode can greatly enhance the readability and maintainability of your regular expressions. Complex patterns become easier to understand, update, and share with others. When working with modern regex engines, take advantage of these features to write cleaner, more maintainable regex patterns.

By making your regex more human-readable, you’ll save time and reduce frustration when dealing with intricate text-processing tasks.

Table of Contents

  1. Regular Expression Tutorial

  2. Different Regular Expression Engines

  3. Literal Characters

  4. Special Characters

  5. Non-Printable Characters

  6. First Look at How a Regex Engine Works Internally

  7. Character Classes or Character Sets

  8. The Dot Matches (Almost) Any Character

  9. Start of String and End of String Anchors

  10. Word Boundaries

  11. Alternation with the Vertical Bar or Pipe Symbol

  12. Optional Items

  13. Repetition with Star and Plus

  14. Grouping with Round Brackets

  15. Named Capturing Groups

  16. Unicode Regular Expressions

  17. Regex Matching Modes

  18. Possessive Quantifiers

  19. Understanding Atomic Grouping in Regular Expressions

  20. Understanding Lookahead and Lookbehind in Regular Expressions (Lookaround)

  21. Testing Multiple Conditions on the Same Part of a String with Lookaround

  22. Understanding the \G Anchor in Regular Expressions

  23. Using If-Then-Else Conditionals in Regular Expressions

  24. XML Schema Character Classes and Subtraction Explained

  25. Understanding POSIX Bracket Expressions in Regular Expressions

  26. Adding Comments to Regular Expressions: Making Your Regex More Readable

  27. Free-Spacing Mode in Regular Expressions: Improving Readability

The \b metacharacter is an anchor, similar to the caret (^) and dollar sign ($). It matches a zero-length position called a word boundary. Word boundaries allow you to perform “whole word” searches in a string using patterns like \bword\b.


What is a Word Boundary?

A word boundary occurs at three possible positions in a string:

  1. Before the first character if it is a word character.

  2. After the last character if it is a word character.

  3. Between two characters where one is a word character and the other is a non-word character.

A word character includes letters, digits, and the underscore ([a-zA-Z0-9_]). Non-word characters are everything else.


Example Usage

The pattern \bword\b matches the word "word" only if it appears as a standalone word in the text.

Regex

String

Matches

\b4\b

"There are 44 sheets"

No

\b4\b

"Sheet number 4 is here"

Yes

Digits are considered word characters, so \b4\b will match a standalone "4" but not when it is part of "44."


Negated Word Boundaries

The \B metacharacter is the negated version of \b. It matches any position that is not a word boundary.

Regex

String

Matches

\Bis\B

"This is a test"

No

\Bis\B

"This island is beautiful"

Yes

\Bis\B would match "is" only if it appears within a word, such as in "island," but not if it appears as a standalone word.


Looking Inside the Regex Engine

Let’s see how the regex \bis\b works on the string "This island is beautiful":

  1. The engine starts with \b at the first character "T." Since \b is zero-width, it checks the position before "T." It matches because "T" is a word character, and the position before it is the start of the string.

  2. The engine then checks the next token, i, which does not match "T," so it moves to the next position.

  3. The engine continues checking until it finds a match at the second "is." The final \b matches before the space after "is," confirming a complete match.


Tcl Word Boundaries

Most regex flavors use \b for word boundaries. However, Tcl uses different syntax:

  • \y matches a word boundary.

  • \Y matches a non-word boundary.

  • \m matches only the start of a word.

  • \M matches only the end of a word.

For example, in Tcl:

  • \mword\M matches "word" as a whole word.

In most flavors, you can achieve the same with \bword\b.


Emulating Tcl Word Boundaries

If your regex flavor supports lookahead and lookbehind, you can emulate Tcl’s \m and \M:

  • (?<!\w)(?=\w): Emulates \m.

  • (?<=\w)(?!\w): Emulates \M.

For flavors without lookbehind, use:

  • \b(?=\w) to emulate \m.

  • \b(?!\w) to emulate \M.


GNU Word Boundaries

GNU extensions to POSIX regular expressions support \b and \B. Additionally, GNU regex introduces:

  • \<: Matches the start of a word (like Tcl’s \m).

  • \>: Matches the end of a word (like Tcl’s \M).

These additional tokens provide flexibility when working with word boundaries in GNU-based tools.


Summary

Word boundaries are crucial for identifying standalone words in text. They prevent partial matches within larger words and ensure more precise regex patterns. Understanding how to use \b, \B, and their equivalents in various regex flavors will help you craft better, more accurate regular expressions.

Table of Contents

  1. Regular Expression Tutorial

  2. Different Regular Expression Engines

  3. Literal Characters

  4. Special Characters

  5. Non-Printable Characters

  6. First Look at How a Regex Engine Works Internally

  7. Character Classes or Character Sets

  8. The Dot Matches (Almost) Any Character

  9. Start of String and End of String Anchors

  10. Word Boundaries

  11. Alternation with the Vertical Bar or Pipe Symbol

  12. Optional Items

  13. Repetition with Star and Plus

  14. Grouping with Round Brackets

  15. Named Capturing Groups

  16. Unicode Regular Expressions

  17. Regex Matching Modes

  18. Possessive Quantifiers

  19. Understanding Atomic Grouping in Regular Expressions

  20. Understanding Lookahead and Lookbehind in Regular Expressions (Lookaround)

  21. Testing Multiple Conditions on the Same Part of a String with Lookaround

  22. Understanding the \G Anchor in Regular Expressions

  23. Using If-Then-Else Conditionals in Regular Expressions

  24. XML Schema Character Classes and Subtraction Explained

  25. Understanding POSIX Bracket Expressions in Regular Expressions

  26. Adding Comments to Regular Expressions: Making Your Regex More Readable

  27. Free-Spacing Mode in Regular Expressions: Improving Readability

Important Information

Terms of Use Privacy Policy Guidelines We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.