Jump to content

Welcome to CodeNameJessica

✨ Welcome to CodeNameJessica! ✨

πŸ’» Where tech meets community.

Hello, Guest! πŸ‘‹
You're just a few clicks away from joining an exclusive space for tech enthusiasts, problem-solvers, and lifelong learners like you.

πŸ” Why Join?
By becoming a member of CodeNameJessica, you’ll get access to:
βœ… In-depth discussions on Linux, Security, Server Administration, Programming, and more
βœ… Exclusive resources, tools, and scripts for IT professionals
βœ… A supportive community of like-minded individuals to share ideas, solve problems, and learn together
βœ… Project showcases, guides, and tutorials from our members
βœ… Personalized profiles and direct messaging to collaborate with other techies

🌐 Sign Up Now and Unlock Full Access!
As a guest, you're seeing just a glimpse of what we offer. Don't miss out on the complete experience! Create a free account today and start exploring everything CodeNameJessica has to offer.

Character Classes or Character Sets (Page 7)

(0 reviews)

Character classes, also known as character sets, allow you to define a set of characters that a regex engine should match at a specific position in the text. To create a character class, place the desired characters between square brackets. For instance, to match either an a or an e, use the pattern [ae]. This can be particularly useful when dealing with variations in spelling, such as in the regex gr[ae]y, which will match both "gray" and "grey."

Key Points About Character Classes:

  • A character class matches only a single character.

  • The order of characters inside a character class does not affect the outcome.

For example, gr[ae]y will not match "graay" or "graey," as the class only matches one character from the set at a time.


Using Ranges in Character Classes

You can specify a range of characters within a character class by using a hyphen (-). For example:

  • [0-9] matches any digit from 0 to 9.

  • [a-fA-F] matches any letter from a to f, regardless of case.

You can also combine multiple ranges and individual characters within a character class:

  • [0-9a-fxA-FX] matches any hexadecimal digit or the letter X.

Again, the order of characters inside the class does not matter.


Useful Applications of Character Classes

Here are some practical use cases for character classes:

  • sep[ae]r[ae]te: Matches "separate" or "seperate" (common spelling errors).

  • li[cs]en[cs]e: Matches "license" or "licence."

  • [A-Za-z_][A-Za-z_0-9]*: Matches identifiers in programming languages.

  • 0[xX][A-Fa-f0-9]+: Matches C-style hexadecimal numbers.


Negated Character Classes

By adding a caret (^) immediately after the opening square bracket, you create a negated character class. This instructs the regex engine to match any character not in the specified set.

For example:

  • q[^u]: Matches a q followed by any character except u.

However, it’s essential to remember that a negated character class still requires a character to follow the initial match. For instance, q[^u] will match the q and the space in "Iraq is a country," but it will not match the q in "Iraq" by itself.

To ensure that the q is not followed by a u, use negative lookahead: q(?!u). We will cover lookaheads later in this tutorial.


Metacharacters Inside Character Classes

Inside character classes, most metacharacters lose their special meaning. However, a few characters retain their special roles:

  • Closing bracket (])

  • Backslash (\)

  • Caret (^) (only if it appears immediately after the opening bracket)

  • Hyphen (-) (only if placed between characters to specify a range)

To include these characters as literals:

  • Backslash (\) must be escaped as [\].

  • Caret (^) can appear anywhere except right after the opening bracket.

  • Closing bracket (]) can be placed right after the opening bracket or caret.

  • Hyphen (-) can be placed at the start or end of the class.

Examples:

  • [x^] matches x or ^.

  • []x] matches ] or x.

  • [^]x] matches any character that is not ] or x.

  • [-x] matches x or -.


Shorthand Character Classes

Shorthand character classes are predefined character sets that simplify your regex patterns. Here are the most common shorthand classes:

Shorthand

Meaning

Equivalent Character Class

\d

Any digit

[0-9]

\w

Any word character

[A-Za-z0-9_]

\s

Any whitespace character

[ \t\r\n]

Details:

  • \d matches digits from 0 to 9.

  • \w includes letters, digits, and underscores.

  • \s matches spaces, tabs, and line breaks. In some flavors, it may also include form feeds and vertical tabs.

The characters included in these shorthand classes may vary depending on the regex flavor. For example:

  • JavaScript treats \d and \w as ASCII-only but includes Unicode characters for \s.

  • XML handles \d and \w as Unicode but limits \s to ASCII characters.

  • Python allows you to control what the shorthand classes match using specific flags.

Shorthand character classes can be used both inside and outside of square brackets:

  • \s\d matches a whitespace character followed by a digit.

  • [\s\d] matches a single character that is either whitespace or a digit.

For instance, when applied to the string "1 + 2 = 3":

  • \s\d matches the space and the digit 2.

  • [\s\d] matches the digit 1.

The shorthand [\da-fA-F] matches a hexadecimal digit and is equivalent to [0-9a-fA-F].


Negated Shorthand Character Classes

The primary shorthand classes also have negated versions:

  • \D: Matches any character that is not a digit. Equivalent to [^\d].

  • \W: Matches any character that is not a word character. Equivalent to [^\w].

  • \S: Matches any character that is not whitespace. Equivalent to [^\s].

Be careful when using negated shorthand inside square brackets. For example:

  • [\D\S] is not the same as [^\d\s].

    • [\D\S] will match any character, including digits and whitespace, because a digit is not whitespace and whitespace is not a digit.

    • [^\d\s] will match any character that is neither a digit nor whitespace.


Repeating Character Classes

You can repeat a character class using quantifiers like ?, *, or +:

  • [0-9]+: Matches one or more digits and can match "837" as well as "222".

If you want to repeat the matched character instead of the entire class, you need to use backreferences:

  • ([0-9])\1+: Matches repeated digits, like "222," but not "837."

    • Applied to the string "833337," this regex matches "3333."

If you want more control over repeated matches, consider using lookahead and lookbehind assertions, which we will explore later in the tutorial.


Looking Inside the Regex Engine

As previously discussed, the order of characters inside a character class does not matter. For instance, gr[ae]y can match both "gray" and "grey."

Let’s see how the regex engine processes gr[ae]y step by step:

Given the string:

"Is his hair grey or gray?"
  1. The engine starts at the first character and fails to match g until it reaches the 13th character.

  2. At the 13th character, g matches.

  3. The next token r matches the following character.

  4. The character class [ae] gives the engine two options:

    • First, it tries a, which fails.

    • Then, it tries e, which matches.

  5. The final token y matches the next character, completing the match.

The engine returns "grey" as the match result and stops searching, even though "gray" also exists in the string. This is because the regex engine is eager to report the first valid match it finds.

Understanding how the regex engine processes character classes helps you write more efficient patterns and predict match results more accurately.

Table of Contents

  1. Regular Expression Tutorial

  2. Different Regular Expression Engines

  3. Literal Characters

  4. Special Characters

  5. Non-Printable Characters

  6. First Look at How a Regex Engine Works Internally

  7. Character Classes or Character Sets

  8. The Dot Matches (Almost) Any Character

  9. Start of String and End of String Anchors

  10. Word Boundaries

  11. Alternation with the Vertical Bar or Pipe Symbol

  12. Optional Items

  13. Repetition with Star and Plus

  14. Grouping with Round Brackets

  15. Named Capturing Groups

  16. Unicode Regular Expressions

  17. Regex Matching Modes

  18. Possessive Quantifiers

  19. Understanding Atomic Grouping in Regular Expressions

  20. Understanding Lookahead and Lookbehind in Regular Expressions (Lookaround)

  21. Testing Multiple Conditions on the Same Part of a String with Lookaround

  22. Understanding the \G Anchor in Regular Expressions

  23. Using If-Then-Else Conditionals in Regular Expressions

  24. XML Schema Character Classes and Subtraction Explained

  25. Understanding POSIX Bracket Expressions in Regular Expressions

  26. Adding Comments to Regular Expressions: Making Your Regex More Readable

  27. Free-Spacing Mode in Regular Expressions: Improving Readability

0 Comments

Recommended Comments

There are no comments to display.

Guest
Add a comment...

Important Information

Terms of Use Privacy Policy Guidelines We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.