Jump to content

Welcome to CodeNameJessica

Welcome to CodeNameJessica!

💻 Where tech meets community.

Hello, Guest! 👋
You're just a few clicks away from joining an exclusive space for tech enthusiasts, problem-solvers, and lifelong learners like you.

🔐 Why Join?
By becoming a member of CodeNameJessica, you’ll get access to:
In-depth discussions on Linux, Security, Server Administration, Programming, and more
Exclusive resources, tools, and scripts for IT professionals
A supportive community of like-minded individuals to share ideas, solve problems, and learn together
Project showcases, guides, and tutorials from our members
Personalized profiles and direct messaging to collaborate with other techies

🌐 Sign Up Now and Unlock Full Access!
As a guest, you're seeing just a glimpse of what we offer. Don't miss out on the complete experience! Create a free account today and start exploring everything CodeNameJessica has to offer.

  • Entries

    47
  • Comments

    0
  • Views

    3855

Entries in this blog

Regular expressions can also match non-printable characters using special sequences. Here are some common examples:

  • \t: Tab character (ASCII 0x09)

  • \r: Carriage return (ASCII 0x0D)

  • \n: Line feed (ASCII 0x0A)

  • \a: Bell (ASCII 0x07)

  • \e: Escape (ASCII 0x1B)

  • \f: Form feed (ASCII 0x0C)

  • \v: Vertical tab (ASCII 0x0B)

Keep in mind that Windows text files use "\r\n" to terminate lines, while UNIX text files use "\n".

Hexadecimal and Unicode Characters

You can include any character in your regex using its hexadecimal or Unicode code point. For example:

  • \x09: Matches a tab character (same as \t).

  • \xA9: Matches the copyright symbol (©) in the Latin-1 character set.

  • \u20AC: Matches the euro currency sign (€) in Unicode.

Additionally, most regex flavors support control characters using the syntax \cA through \cZ, which correspond to Control+A through Control+Z. For example:

  • \cM: Matches a carriage return, equivalent to \r.

In XML Schema regex, the token «\c» is a shorthand for matching any character allowed in an XML name.

When working with Unicode regex engines, it’s best to use the \uFFFF notation to ensure compatibility with a wide range of characters.

Table of Contents

  1. Regular Expression Tutorial

  2. Different Regular Expression Engines

  3. Literal Characters

  4. Special Characters

  5. Non-Printable Characters

  6. First Look at How a Regex Engine Works Internally

  7. Character Classes or Character Sets

  8. The Dot Matches (Almost) Any Character

  9. Start of String and End of String Anchors

  10. Word Boundaries

  11. Alternation with the Vertical Bar or Pipe Symbol

  12. Optional Items

  13. Repetition with Star and Plus

  14. Grouping with Round Brackets

  15. Named Capturing Groups

  16. Unicode Regular Expressions

  17. Regex Matching Modes

  18. Possessive Quantifiers

  19. Understanding Atomic Grouping in Regular Expressions

  20. Understanding Lookahead and Lookbehind in Regular Expressions (Lookaround)

  21. Testing Multiple Conditions on the Same Part of a String with Lookaround

  22. Understanding the \G Anchor in Regular Expressions

  23. Using If-Then-Else Conditionals in Regular Expressions

  24. XML Schema Character Classes and Subtraction Explained

  25. Understanding POSIX Bracket Expressions in Regular Expressions

  26. Adding Comments to Regular Expressions: Making Your Regex More Readable

  27. Free-Spacing Mode in Regular Expressions: Improving Readability

To go beyond matching literal text, regex engines reserve certain characters for special functions. These are known as metacharacters. The following characters have special meanings in most regex flavors discussed in this tutorial:

[ \ ^ $ . | ? * + ( )

If you need to use any of these characters as literals in your regex, you must escape them with a backslash (\). For instance, to match "1+1=2", you would write the regex as:

1\+1=2

Without the backslash, the plus sign would be interpreted as a quantifier, causing unexpected behavior. For example, the regex «1+1=2» would match "111=2" in the string "123+111=234" because the plus sign is interpreted as "one or more of the preceding characters."

Escaping Special Characters

To escape a metacharacter, simply prepend it with a backslash (). For example:

  • «.» matches a literal dot.

  • «*» matches a literal asterisk.

  • «+» matches a literal plus sign.

Most regex flavors also support the \Q...\E escape sequence. This treats everything between \Q and \E as literal characters. For example:

\Q*\d+*\E

This pattern matches the literal text "\d+". If the \E is omitted at the end, it is assumed. This syntax is supported by many engines, including Perl, PCRE, Java, and JGsoft, but it may have quirks in older Java versions.


Special Characters in Programming Languages

If you're a programmer, you might expect characters like single and double quotes to be special characters in regex. However, in most regex engines, they are treated as literal characters.

In programming, you must be mindful of characters that your language treats specially within strings. These characters will be processed by the compiler before being passed to the regex engine. For instance:

  • To use the regex «1+1=2» in C++ code, you would write it as "1\+1=2". The compiler converts the double backslashes into a single backslash for the regex engine.

  • To match a Windows file path like "c:\temp", the regex would be «c:\temp», and in C++ code, it would be written as "c:\\temp".

Refer to the specific language documentation to understand how to handle regex patterns within your code.

Table of Contents

  1. Regular Expression Tutorial

  2. Different Regular Expression Engines

  3. Literal Characters

  4. Special Characters

  5. Non-Printable Characters

  6. First Look at How a Regex Engine Works Internally

  7. Character Classes or Character Sets

  8. The Dot Matches (Almost) Any Character

  9. Start of String and End of String Anchors

  10. Word Boundaries

  11. Alternation with the Vertical Bar or Pipe Symbol

  12. Optional Items

  13. Repetition with Star and Plus

  14. Grouping with Round Brackets

  15. Named Capturing Groups

  16. Unicode Regular Expressions

  17. Regex Matching Modes

  18. Possessive Quantifiers

  19. Understanding Atomic Grouping in Regular Expressions

  20. Understanding Lookahead and Lookbehind in Regular Expressions (Lookaround)

  21. Testing Multiple Conditions on the Same Part of a String with Lookaround

  22. Understanding the \G Anchor in Regular Expressions

  23. Using If-Then-Else Conditionals in Regular Expressions

  24. XML Schema Character Classes and Subtraction Explained

  25. Understanding POSIX Bracket Expressions in Regular Expressions

  26. Adding Comments to Regular Expressions: Making Your Regex More Readable

  27. Free-Spacing Mode in Regular Expressions: Improving Readability

The simplest regular expressions consist of literal characters. A literal character is a character that matches itself. For example, the regex «a» will match the first occurrence of the character "a" in a string. Consider the string "Jack is a boy": this pattern will match the "a" after the "J".

It’s important to note that the regex engine doesn’t care where the match occurs within a word unless instructed otherwise. If you want to match entire words, you’ll need to use word boundaries, a concept we’ll cover later.

Similarly, the regex «cat» will match the word "cat" in the string "About cats and dogs." This pattern consists of three literal characters in sequence: c, a, and t. The regex engine looks for these characters in the specified order.

Case Sensitivity

By default, most regex engines are case-sensitive. This means that the pattern cat will not match "Cat" unless you explicitly configure the engine to perform a case-insensitive search.

Table of Contents

  1. Regular Expression Tutorial

  2. Different Regular Expression Engines

  3. Literal Characters

  4. Special Characters

  5. Non-Printable Characters

  6. First Look at How a Regex Engine Works Internally

  7. Character Classes or Character Sets

  8. The Dot Matches (Almost) Any Character

  9. Start of String and End of String Anchors

  10. Word Boundaries

  11. Alternation with the Vertical Bar or Pipe Symbol

  12. Optional Items

  13. Repetition with Star and Plus

  14. Grouping with Round Brackets

  15. Named Capturing Groups

  16. Unicode Regular Expressions

  17. Regex Matching Modes

  18. Possessive Quantifiers

  19. Understanding Atomic Grouping in Regular Expressions

  20. Understanding Lookahead and Lookbehind in Regular Expressions (Lookaround)

  21. Testing Multiple Conditions on the Same Part of a String with Lookaround

  22. Understanding the \G Anchor in Regular Expressions

  23. Using If-Then-Else Conditionals in Regular Expressions

  24. XML Schema Character Classes and Subtraction Explained

  25. Understanding POSIX Bracket Expressions in Regular Expressions

  26. Adding Comments to Regular Expressions: Making Your Regex More Readable

  27. Free-Spacing Mode in Regular Expressions: Improving Readability

A regular expression engine is a software component that processes regex patterns, attempting to match them against a given string. Typically, you won’t interact directly with the engine. Instead, it operates behind the scenes within applications and programming languages, which invoke the engine as needed to apply the appropriate regex patterns to your data or files.

Variations Across Regex Engines

As is often the case in software development, not all regex engines are created equal. Different engines support different regex syntaxes, often referred to as regex flavors. This tutorial focuses on the Perl 5 regex flavor, widely considered the most popular and influential. Many modern engines, including the open-source PCRE (Perl-Compatible Regular Expressions) engine, closely mimic Perl 5’s syntax but may introduce slight variations. Other notable engines include:

  • .NET Regular Expression Library

  • Java’s Regular Expression Package (included from JDK 1.4 onwards)

Whenever significant differences arise between flavors, this guide will highlight them, ensuring you understand which features are specific to Perl-derived engines.


Getting Hands-On with Regex

You can start experimenting with regular expressions in any text editor that supports regex functionality. One recommended option is EditPad Pro, which offers a robust regex engine in its evaluation version.

To try it out:

  1. Copy and paste the text from this page into EditPad Pro.

  2. From the menu, select Search > Show Search Panel to open the search pane at the bottom.

  3. In the Search Text box, type «regex».

  4. Check the Regular expression option.

  5. Click Find First to locate the first match. Use Find Next to jump to subsequent matches. When there are no more matches, the Find Next button will briefly flash.


A More Advanced Example

Let’s take it a step further. Try searching for the following regex pattern:

«reg(ular expressions?|ex(p|es)?)»

This pattern matches all variations of the term "regex" used on this page, whether singular or plural. Without regex, you’d need to perform five separate searches to achieve the same result. With regex, one pattern does the job, saving you significant time and effort.

For instance, in EditPad Pro, select Search > Count Matches to see how many times the regex matches the text. This feature showcases the power of regex for efficient text processing.


Why Use Regex in Programming?

For programmers, regexes offer both performance and productivity benefits:

  • Efficiency: Even a basic regex engine can outperform state-of-the-art plain text search algorithms by applying a pattern once instead of running multiple searches.

  • Reduced Development Time: Checking if a user’s input resembles a valid email address can be accomplished with a single line of code in languages like Perl, PHP, Java, or .NET, or with just a few lines when using libraries like PCRE in C.

By incorporating regex into your workflows and applications, you can achieve faster, more efficient text processing and validation tasks.

Table of Contents

  1. Regular Expression Tutorial

  2. Different Regular Expression Engines

  3. Literal Characters

  4. Special Characters

  5. Non-Printable Characters

  6. First Look at How a Regex Engine Works Internally

  7. Character Classes or Character Sets

  8. The Dot Matches (Almost) Any Character

  9. Start of String and End of String Anchors

  10. Word Boundaries

  11. Alternation with the Vertical Bar or Pipe Symbol

  12. Optional Items

  13. Repetition with Star and Plus

  14. Grouping with Round Brackets

  15. Named Capturing Groups

  16. Unicode Regular Expressions

  17. Regex Matching Modes

  18. Possessive Quantifiers

  19. Understanding Atomic Grouping in Regular Expressions

  20. Understanding Lookahead and Lookbehind in Regular Expressions (Lookaround)

  21. Testing Multiple Conditions on the Same Part of a String with Lookaround

  22. Understanding the \G Anchor in Regular Expressions

  23. Using If-Then-Else Conditionals in Regular Expressions

  24. XML Schema Character Classes and Subtraction Explained

  25. Understanding POSIX Bracket Expressions in Regular Expressions

  26. Adding Comments to Regular Expressions: Making Your Regex More Readable

  27. Free-Spacing Mode in Regular Expressions: Improving Readability

Table of Contents

  1. Regular Expression Tutorial

  2. Different Regular Expression Engines

  3. Literal Characters

  4. Special Characters

  5. Non-Printable Characters

  6. First Look at How a Regex Engine Works Internally

  7. Character Classes or Character Sets

  8. The Dot Matches (Almost) Any Character

  9. Start of String and End of String Anchors

  10. Word Boundaries

  11. Alternation with the Vertical Bar or Pipe Symbol

  12. Optional Items

  13. Repetition with Star and Plus

  14. Grouping with Round Brackets

  15. Named Capturing Groups

  16. Unicode Regular Expressions

  17. Regex Matching Modes

  18. Possessive Quantifiers

  19. Understanding Atomic Grouping in Regular Expressions

  20. Understanding Lookahead and Lookbehind in Regular Expressions (Lookaround)

  21. Testing Multiple Conditions on the Same Part of a String with Lookaround

  22. Understanding the \G Anchor in Regular Expressions

  23. Using If-Then-Else Conditionals in Regular Expressions

  24. XML Schema Character Classes and Subtraction Explained

  25. Understanding POSIX Bracket Expressions in Regular Expressions

  26. Adding Comments to Regular Expressions: Making Your Regex More Readable

  27. Free-Spacing Mode in Regular Expressions: Improving Readability

Welcome to this comprehensive guide on Regular Expressions (Regex). This tutorial is designed to equip you with the skills to craft powerful, time-saving regular expressions from scratch. We'll begin with foundational concepts, ensuring you can follow along even if you're new to the world of regex. However, this isn't just a basic guide; we'll delve deeper into how regex engines operate internally, giving you insights that will help you troubleshoot and optimize your patterns effectively.

What Are Regular Expressions? — Understanding the Basics

At its core, a regular expression is a pattern used to match sequences of text. The term originates from formal language theory, but for practical purposes, it refers to text-matching rules you can use across various applications and programming languages.

You'll often encounter abbreviations like regex or regexp. In this guide, we'll use "regex" as it flows naturally when pluralized as "regexes." Throughout this manual, regex patterns will be displayed within guillemets: «pattern». This notation clearly differentiates the regex from surrounding text or punctuation.

For example, the simple pattern «regex» is a valid regex that matches the literal text "regex." The term match refers to the segment of text that the regex engine identifies as conforming to the specified pattern. Matches will be highlighted using double quotation marks, such as "match."

A First Look at a Practical Regex Example

Let's consider a more complex pattern:

\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b

This regex describes an email address pattern. Breaking it down:

  • \b: Denotes a word boundary to ensure the match starts at a distinct word.

  • [A-Z0-9._%+-]+: Matches one or more letters, digits, dots, underscores, percentage signs, plus signs, or hyphens.

  • @: The literal at-sign.

  • [A-Z0-9.-]+: Matches the domain name.

  • .: A literal dot.

  • [A-Z]{2,4}: Matches the top-level domain (TLD) consisting of 2 to 4 letters.

  • \b: Ensures the match ends at a word boundary.

With this pattern, you can:

  • Search text files to identify email addresses.

  • Validate whether a given string resembles a legitimate email address format.

In this tutorial, we'll refer to the text being processed as a string. This term is commonly used by programmers to describe a sequence of characters. Strings will be denoted using regular double quotes, such as "example string."

Regex patterns can be applied to any data that a programming language or software application can access, making them an incredibly versatile tool in text processing and data validation tasks.

Next, we'll explore how to construct regex patterns step by step, starting from simple character matches to more advanced techniques like capturing groups and lookaheads. Let's dive in!

Important Information

Terms of Use Privacy Policy Guidelines We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.