Jessica Brown

Regular Expression Tutorial (TOC)

By Jessica Brown
January 10Jan 10
Tutorials

Regular Expressions Tutorial Table of Contents

Regular Expression Tutorial pg 1	Word Boundaries pg 10	Understanding Atomic Grouping in Regular Expressions pg 19
Different Regular Expression Engines pg 2	Alternation with the Vertical Bar or Pipeline Symbol pg 11	Understanding Lookahead and Lookbehind in Regular Expressions (Lookaround) pg 20
Literal Characters pg 3	Optional Items pg 12	Testing Multiple Conditions on the Same Part of a String with Lookaround pg 21
Special Characters pg 4	Repetition with Star and Plus pg 13	Understanding the \G Anchor in Regular Expressions pg 22
Non-Printable Characters pg 5	Grouping with Round Brackets pg 14	Using If-Then-Else Conditionals in Regular Expressions pg 23
First Look at How a Regex Engine Works Internally pg 6	Named Capturing Groups pg 15	XML Schema Character Classes and Subtraction Explained pg 24
Character Classes or Character Sets pg 7	Unicode Regular Expressions pg 16	Understanding POSIX Bracket Expressions in Regular Expressions pg 25
The Dot Matches (Almost) Any Character pg 8	Regex Matching Modes pg 17	Adding Comments to Regular Expressions: Making Your Regex More Readable pg 26
Start of String and End of String Anchors pg 9	Possessive Quantifiers pg 18	Free-Spacing Mode in Regular Expressions: Improving Readability pg 27

Read more...

182 views
0 comments

JavaScript Skill Progression: The Path from Beginner to Extreme

By Jessica Brown
January 23Jan 23
Tutorials

Level 1 - The Foundations: Understanding JavaScript Basics

Introduction to JavaScript: What it is, how it works, and where it runs (browsers, Node.js). (part 1)
JavaScript Variables & Data Types: var, let, const, and primitive types (String, Number, Boolean, Undefined, Null, Symbol, BigInt). (part 2)
JavaScript Operators & Expressions: Arithmetic, comparison, logical, and assignment operators. (part 3)
JavaScript Conditional Statements: if, else, switch. (part 4)
JavaScript Loops & Iteration: for, while, do-while. (part 5)
JavaScript Functions: Function declarations, expressions, arrow functions, parameters, and return values. (part 6)
JavaScript Basic Debugging: console.log(), alert(), and browser developer tools. (part 7)

Level 2 - Building Blocks: DOM Manipulation & Event Handling

Introduction to the DOM (Document Object Model): Understanding the structure of an HTML document. (part 8)
Selecting Elements in JavaScript: document.getElementById(), document.querySelector(), document.querySelectorAll(). (part 9)
Modifying Elements in JavaScript: Changing text, attributes, classes, and styles dynamically. (part 10)
JavaScript Event Handling: addEventListener(), event types (click, mouseover, keypress, etc.). (part 11)
JavaScript Forms & User Input: Handling form submissions, input validation, and preventing default behavior. (part 12)
JavaScript Timers & Intervals: setTimeout(), setInterval(). (part 13)
Intro to JavaScript Browser Storage: Local Storage, Session Storage, and Cookies. (part 14)

Level 3 - Advancing Forward: Asynchronous JavaScript & APIs

Synchronous vs Asynchronous in JavaScript Programming: Understanding blocking vs non-blocking operations. (part 15)
JavaScript Callbacks & Callback Hell: Handling asynchronous execution with callback functions. (part 16)
Promises & .then() Chainingin with JavaScript: Writing cleaner async code with Promise objects. (part 17)
JavaScript Async/Await: Modern async handling, try-catch for error handling. (part 18)
Working with APIs in JavaScript: Fetching data using fetch() and handling JSON responses. (part 19)
AJAX & HTTP Requests, the JavaScript Way: Understanding HTTP methods (GET, POST, PUT, DELETE). (part 20)
JavaScript Error Handling & Debugging: try, catch, finally, throw. (part 21)

Level 4 - Professional Development: Object-Oriented & Functional Programming

JavaScript Object Basics: Object literals, properties, methods. (part 22)
JavaScript Prototypes & Inheritance: Prototype chaining, Object.create(), and classes in ES6. (part 23)
Encapsulation & Private Methods in JavaScript: Using closures and ES6 classes to protect data. (part 24)
Functional Programming Principles Using JavaScript: Higher-Order Functions, Immutability, Closures & Lexical Scope, Array Methods, and Recursions (part 25)

Level 5 - Expert Craftsmanship: Performance Optimization & Design Patterns

Code Performance & Optimization with JavaScript: Minimizing memory usage, reducing reflows, Event Loops, avoiding memory leaks (part 26)
Using JavaScript with Event Loop & Concurrency Model: How JavaScript handles tasks asynchronously. (part 27)
JavaScript’s Web Workers & Multithreading: Running JavaScript in parallel threads. (part 28)
Debouncing & Throttling with JavaScript: Optimizing performance-heavy event listeners. (part 29)
Design Patterns in JavaScript: Singleton, Factory, Observer, Module, and Proxy patterns (part 30)
JavaScript’s Best Security Practices: Avoiding XSS, CSRF, and SQL Injection, Sanitizing user inputs (part 31)

Level 6 - The Extreme Zone: Meta-Programming & JavaScript Internals

Understanding the JavaScript Engine: How V8 and SpiderMonkey parse and execute JavaScript. (part 32)
Execution Context & Call Stack: Understanding how JavaScript executes code. (part 33)
Memory Management & Garbage Collection: How JavaScript handles memory allocation. (part 34)
Proxies & Reflect API: Intercepting and customizing fundamental operations. (part 35)
Symbol & WeakMap Usage: Advanced ways to manage object properties. (part 36)
WebAssembly (WASM): Running low-level compiled code in the browser. (part 37)
Building Your Own Framework: Understanding how libraries like React, Vue, or Angular work under the hood. (part 38)
Node.js & Backend JavaScript: Running JavaScript outside the browser. (part 39)

Read more...

359 views
0 comments

Harden and Secure Linux Servers by Level (1 - 6)

By Jessica Brown
January 17Jan 17
Tutorials

Securing a Linux server is an ongoing challenge. Every day, bad actors attempt to penetrate systems worldwide, using VPNs, IP spoofing, and other evasion tactics to obscure their origins. The source of an attack is often the least of your concerns, what matters most is implementing strong security measures to deter threats and protect your infrastructure. Hardening your servers not only makes them more resilient but also forces attackers to either move on or, ideally, abandon their efforts altogether.

This list of security recommendations is based on current best practices but should be implemented with caution. Always test configurations in a controlled environment before applying them to production servers. The examples and settings provided in each article are meant as guidelines and should be tailored to suit your specific setup. If you have any questions, sign up for an account and post them within the relevant article's discussion.

Build a Hardened and Secure Linux Server (Level 1)

Protecting a Linux server involves more than just installing and configuring it. Servers are constantly at risk from threats like brute-force attacks, malware, and misconfigurations. This guide outlines crucial steps to enhance your server’s security, providing clear instructions and explanations for each measure. By following these steps, you can significantly improve your server’s resilience against potential threats!

Strengthen and Secure Your Linux Server (Level 2)

Securing a Linux server goes beyond basic installation and configuration, it requires proactive measures to mitigate risks such as brute-force attacks, malware infiltration, and system misconfigurations. This guide provides a structured approach to hardening your server, detailing essential security best practices with step-by-step instructions. By implementing these measures, you can fortify your server against vulnerabilities, ensuring a more robust and resilient security posture.

Comprehensive Linux Server Hardening and Security Implementation (Level 3)

Achieving a truly secure Linux server requires a systematic and multi-layered approach, addressing both external threats and internal vulnerabilities. This guide delves into advanced security strategies, covering proactive defense mechanisms against brute-force attacks, malware infiltration, privilege escalation, and misconfigurations. It includes in-depth explanations of key hardening techniques, such as secure authentication methods, firewall optimization, intrusion detection systems, and least privilege enforcement. By following this guide, you will establish a fortified Linux environment with enhanced resilience against evolving cyber threats.

Advanced Linux Server Security and Threat Mitigation (Level 4)

At this level, securing a Linux server involves proactive measures that go beyond traditional hardening techniques. Advanced security configurations focus on mitigating sophisticated cyber threats, ensuring continuous monitoring, and implementing preventive controls. This guide explores methods such as sandboxing applications, enhancing authentication security, and conducting in-depth vulnerability assessments to fortify your server against emerging risks.

Enterprise-Grade Linux Security and Defense Mechanisms (Level 5)

As security threats become more sophisticated, enterprise-level hardening techniques ensure that a Linux server remains resilient against persistent and targeted attacks. This level focuses on securing sensitive data, enforcing strict access controls, and implementing deception technologies like honeypots to detect and analyze potential intrusions. By incorporating Zero-Trust principles and using Just-In-Time (JIT) access controls, organizations can minimize the risk of privilege escalation and unauthorized access.

Maximum Security and Compliance-Driven Hardening (Level 6)

At the highest level, Linux server security must meet stringent regulatory compliance requirements while maintaining peak resilience against cyber threats. This guide covers advanced measures such as kernel hardening with Grsecurity, comprehensive security event management, and role-based access control (RBAC) enforcement for applications. Additionally, it emphasizes data retention policies and deception techniques such as honeytokens to detect unauthorized access. These measures ensure long-term security, forensic readiness, and strict compliance with industry standards.

Read more...

1448 views
0 comments

Free-Spacing Mode in Regular Expressions: Improving Readability (Page 27)

By Jessica Brown
January 10Jan 10
Tutorials

Free-spacing mode, also known as whitespace-insensitive mode, allows you to write regular expressions with added spaces, tabs, and line breaks to make them more readable. This mode is supported by many popular regex engines, including JGsoft, .NET, Java, Perl, PCRE, Python, Ruby, and XPath.

How to Enable Free-Spacing Mode

To activate free-spacing mode, you can use the mode modifier (?x) within your regex. Alternatively, many programming languages and applications offer options to enable free-spacing mode when constructing regex patterns.

Here’s an example of how to enable free-spacing mode in a regex pattern:

(?x) (19|20) \d\d [- /.] (0[1-9]|1[012]) [- /.] (0[1-9]|[12][0-9]|3[01])

What Does Free-Spacing Mode Do?

In free-spacing mode, whitespace between regex tokens is ignored, allowing you to organize your regex pattern with spaces and line breaks for better readability.

For example, these two regex patterns are treated the same in free-spacing mode:

abc
a b c

However, whitespace within tokens is not ignored. Breaking up a token with spaces can change its meaning or cause syntax errors.

For instance:

Pattern	Explanation
`\d`	Matches a digit (0-9).
`\ d`	Matches a literal space followed by the letter "d".

The token \d must remain intact. Adding a space between the backslash and the letter changes its meaning.

Grouping Modifiers and Special Constructs

In free-spacing mode, special constructs like atomic groups, lookaround assertions, and named groups must remain intact. Splitting them with spaces will cause syntax errors.

Here are a few examples:

Correct	Incorrect	Explanation
`(?>atomic)`	`(? >atomic)`	The atomic group modifier `?>` must remain together.
`(?=condition)`	`(? =condition)`	The lookahead assertion `?=` cannot be split.
`(?P<name>group)`	`(?P <name>group)`	Named groups must be written as a single token.

Character Classes in Free-Spacing Mode

In most regex engines, character classes (enclosed in square brackets) are treated as single tokens, meaning free-spacing mode does not affect the whitespace inside them.

For example:

[abc]
[ a b c ]

In most regex engines, these two patterns are not the same:

[abc] matches any of the characters a, b, or c.
[ a b c ] matches a, b, c, or a space.

However, Java’s free-spacing mode is an exception. In Java, whitespace inside character classes is ignored, so:

[abc]
[ a b c ]

Both patterns are treated the same in Java.

Important Notes for Java

In Java’s free-spacing mode:

The negating caret (^) must appear immediately after the opening bracket.
- Correct: [ ^abc ] (Matches any character except a, b, or c).
- Incorrect: [ ^ abc ] (This would incorrectly match the caret symbol itself).

Adding Comments in Free-Spacing Mode

One of the most useful features of free-spacing mode is the ability to add comments to your regex patterns using the # symbol.

The # symbol starts a comment that runs until the end of the line.
Everything after the # is ignored by the regex engine.

Here’s an example of how comments can improve the readability of a complex regex pattern:

# Match a date in yyyy-mm-dd format
(19|20)\d\d      # Year (1900-2099)
[- /.]           # Separator (dash, slash, or dot)
(0[1-9]|1[012])  # Month (01 to 12)
[- /.]           # Separator
(0[1-9]|[12][0-9]|3[01])  # Day (01 to 31)

With comments and line breaks, this regex becomes much easier to understand and maintain.

Which Regex Engines Support Free-Spacing Mode?

Here’s a quick overview of regex engines that support free-spacing mode and comments:

Regex Engine	Supports Free-Spacing Mode?	Supports Comments?
JGsoft	✅ Yes	✅ Yes
.NET	✅ Yes	✅ Yes
Java	✅ Yes	❌ No
Perl	✅ Yes	✅ Yes
PCRE	✅ Yes	✅ Yes
Python	✅ Yes	✅ Yes
Ruby	✅ Yes	✅ Yes
XPath	✅ Yes	❌ No

Summary of Key Rules for Free-Spacing Mode

Whitespace between tokens is ignored, making your regex more readable.
Whitespace within tokens is not ignored. Tokens like \d, (?=), and (?>) must remain intact.
Character classes are treated as single tokens in most engines, except for Java.
Comments can be added using the # symbol, except in XPath, where # is always treated as a literal character.

Putting It All Together: A Date Matching Example

Here’s how you can write a date-matching regex using free-spacing mode and comments for clarity:

# Match a date in yyyy-mm-dd format
(?x)             # Enable free-spacing mode
(19|20)\d\d      # Year (1900-2099)
[- /.]           # Separator
(0[1-9]|1[012])  # Month (01 to 12)
[- /.]           # Separator
(0[1-9]|[12][0-9]|3[01])  # Day (01 to 31)

Without free-spacing mode, this same regex would look like this:

(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])

The difference in readability is clear.

Free-spacing mode is a valuable tool for improving the readability and maintainability of regular expressions. It allows you to format your patterns with spaces, line breaks, and comments, making complex regex easier to understand.

By taking advantage of free-spacing mode and comments, you can write cleaner, more efficient regular expressions that are easier to debug, share, and update.

Table of Contents

Read more...

152 views
0 comments

Adding Comments to Regular Expressions: Making Your Regex More Readable (Page 26)

By Jessica Brown
January 10Jan 10
Tutorials

Regular expressions can quickly become complex and difficult to understand, especially when dealing with long patterns. To make them easier to read and maintain, many modern regex engines allow you to add comments directly into your regex patterns. This makes it possible to explain what each part of the expression does, reducing confusion and improving readability.

How to Add Comments in Regular Expressions

The syntax for adding a comment inside a regex is:

(?#comment)

The text inside the parentheses after ?# is treated as a comment.
The regex engine ignores everything inside the comment until it encounters a closing parenthesis ).
The comment can be anything you want, as long as it does not include a closing parenthesis.

For example, here’s a regex to match a valid date in the format yyyy-mm-dd, with comments to explain each part:

(?#year)(19|20)\d\d[- /.](?#month)(0[1-9]|1[012])[- /.](?#day)(0[1-9]|[12][0-9]|3[01])

This regex is much more understandable with comments:

(?#year): Marks the section that matches the year.
(?#month): Marks the section that matches the month.
(?#day): Marks the section that matches the day.

Without these comments, the regex would be difficult to decipher at a glance.

Benefits of Using Comments in Regular Expressions

Adding comments to your regex patterns offers several benefits:

Improves readability: Comments clarify the purpose of each section of your regex, making it easier to understand.
Simplifies maintenance: If you need to update a regex later, comments make it easier to remember what each part of the pattern does.
Helps collaboration: When sharing regex patterns with others, comments make it easier for them to follow your logic.

Using Free-Spacing Mode for Better Formatting

In addition to inline comments, many regex engines also support free-spacing mode, which allows you to add spaces and line breaks to your regex without affecting the match.

Free-spacing mode makes your regex more structured and readable by allowing you to organize it into logical sections. To enable free-spacing mode:

In Perl, PCRE, Python, and Ruby, use the /x modifier to activate free-spacing mode.
In .NET, use the RegexOptions.IgnorePatternWhitespace option.
In Java, use the Pattern.COMMENTS flag.

Here’s an example of how free-spacing mode can improve the readability of a regex:

Without Free-Spacing Mode:

(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])

With Free-Spacing Mode and Comments:

(?#year) (19|20) \d\d        # Match years 1900 to 2099
[- /.]                       # Separator (dash, slash, or dot)
(?#month) (0[1-9] | 1[012])  # Match months 01 to 12
[- /.]                       # Separator
(?#day) (0[1-9] | [12][0-9] | 3[01])  # Match days 01 to 31

The second version is far easier to read and maintain.

Which Regex Engines Support Comments?

Most modern regex engines support the (?#comment) syntax for adding comments, including:

Regex Engine	Supports Comments?	Supports Free-Spacing Mode?
JGsoft	✅ Yes	✅ Yes
.NET	✅ Yes	✅ Yes
Perl	✅ Yes	✅ Yes
PCRE	✅ Yes	✅ Yes
Python	✅ Yes	✅ Yes
Ruby	✅ Yes	✅ Yes
Java	❌ No	✅ Yes (via `Pattern.COMMENTS`)

Example: Using Comments to Document a Complex Regex

Here’s an example of a more complex regex that extracts email addresses from a text file. Without comments, the regex looks like this:

\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b

Adding comments and using free-spacing mode makes it much more understandable:

\b                      # Word boundary to ensure we're at the start of a word
[A-Za-z0-9._%+-]+       # Local part of the email (before @)
@                       # At symbol
[A-Za-z0-9.-]+          # Domain name
\.                      # Dot before the top-level domain
[A-Za-z]{2,}            # Top-level domain (e.g., com, net, org)
\b                      # Word boundary to ensure we're at the end of a word

Key Points to Remember

Comments in regex are added using the (?#comment) syntax.
Free-spacing mode makes regex patterns more readable by allowing spaces and line breaks.
Supported engines include JGsoft, .NET, Perl, PCRE, Python, and Ruby.
Java supports free-spacing mode but does not support inline comments.

When to Use Comments and Free-Spacing Mode

Use comments and free-spacing mode when:

Your regex pattern is complex and hard to read.
You’re working on a team and need to make your patterns understandable to others.
You need to revisit your regex after some time and want to avoid deciphering cryptic patterns.

Adding comments and using free-spacing mode can greatly enhance the readability and maintainability of your regular expressions. Complex patterns become easier to understand, update, and share with others. When working with modern regex engines, take advantage of these features to write cleaner, more maintainable regex patterns.

By making your regex more human-readable, you’ll save time and reduce frustration when dealing with intricate text-processing tasks.

Table of Contents

Read more...

115 views
0 comments

Understanding POSIX Bracket Expressions in Regular Expressions (Page 25)

By Jessica Brown
January 10Jan 10
Tutorials

POSIX bracket expressions are a specialized type of character class used in regular expressions. Like standard character classes, they match a single character from a specified set of characters. However, they offer additional features such as locale support and unique character classes that aren't found in other regex flavors.

Key Differences Between POSIX Bracket Expressions and Standard Character Classes

POSIX bracket expressions are enclosed in square brackets ([]), just like regular character classes. However, there are some important differences:

No Escape Sequences: In POSIX bracket expressions, the backslash (\) is not treated as a metacharacter. This means that characters like \d or \w are interpreted as literal characters rather than shorthand classes.
For example:
- [\d] in a POSIX bracket expression matches either a backslash (\) or the letter d.
- In most other regex flavors, [\d] matches a digit.
Special Characters:
- To match a closing bracket (]), place it immediately after the opening bracket or negating caret (^).
- To match a hyphen (-), place it at the beginning or end of the expression.
- To match a caret (^), place it anywhere except immediately after the opening bracket.

Here’s an example of a POSIX bracket expression that matches various special characters:

[]\d^-]

This expression matches any of the following characters: ], \, d, ^, or -.

POSIX Character Classes

POSIX defines a set of character classes that represent specific groups of characters. These classes adapt to the locale settings of the user or application, making them useful for handling different languages and cultural conventions.

Common POSIX Character Classes and Their Equivalents

POSIX Class	Description	ASCII Equivalent	Unicode Equivalent	Shorthand (if any)	Java Equivalent
`[:alnum:]`	Alphanumeric characters	`[a-zA-Z0-9]`	`[\p{L&}\p{Nd}]`		`\p{Alnum}`
`[:alpha:]`	Alphabetic characters	`[a-zA-Z]`	`\p{L&}`		`\p{Alpha}`
`[:ascii:]`	ASCII characters	`[\x00-\x7F]`	`\p{InBasicLatin}`		`\p{ASCII}`
`[:blank:]`	Space and tab characters	`[ \t]`	`[\p{Zs}\t]`		`\p{Blank}`
`[:cntrl:]`	Control characters	`[\x00-\x1F\x7F]`	`\p{Cc}`		`\p{Cntrl}`
`[:digit:]`	Digits	`[0-9]`	`\p{Nd}`	`\d`	`\p{Digit}`
`[:graph:]`	Visible characters	`[\x21-\x7E]`	`[^\p{Z}\p{C}]`		`\p{Graph}`
`[:lower:]`	Lowercase letters	`[a-z]`	`\p{Ll}`		`\p{Lower}`
`[:print:]`	Visible characters, including spaces	`[\x20-\x7E]`	`\P{C}`		`\p{Print}`
`[:punct:]`	Punctuation and symbols	`[!"#$%&'()*+,\-./:;<=>?@[\\\]^_`{	}~]`	`[\p{P}\p{S}]`
`[:space:]`	Whitespace characters, including line breaks	`[ \t\r\n\v\f]`	`[\p{Z}\t\r\n\v\f]`	`\s`	`\p{Space}`
`[:upper:]`	Uppercase letters	`[A-Z]`	`\p{Lu}`		`\p{Upper}`
`[:word:]`	Word characters (letters, digits, underscores)	`[A-Za-z0-9_]`	`[\p{L}\p{N}\p{Pc}]`	`\w`
`[:xdigit:]`	Hexadecimal digits	`[A-Fa-f0-9]`	`[A-Fa-f0-9]`		`\p{XDigit}`

Using POSIX Bracket Expressions with Negation

You can negate POSIX bracket expressions by placing a caret (^) immediately after the opening bracket. For example:

[^x-z[:digit:]]

This pattern matches any character except x, y, z, or a digit.

Collating Sequences in POSIX Locales

A collating sequence defines how certain characters or character combinations should be treated as a single unit when sorting. For example, in Spanish, the sequence "ll" is treated as a single letter that falls between "l" and "m".

To use a collating sequence in a regex, enclose it in double square brackets:

[[.span-ll.]]

For example, the pattern:

torti[[.span-ll.]]a

Matches "tortilla" in a Spanish locale.

However, collating sequences are rarely supported outside of fully POSIX-compliant regex engines. Even within POSIX engines, the locale must be set correctly for the sequence to be recognized.

Character Equivalents in POSIX Locales

Character equivalents are another feature of POSIX locales that treat certain characters as interchangeable for sorting purposes. For example, in French:

é, è, and ê are treated as equivalent to e.
The word "élève" would come before "être" and "événement" in alphabetical order.

To use character equivalents in a regex, use the following syntax:

[[=e=]]

For example:

[[=e=]]xam

Matches any of "exam", "éxam", "èxam", or "êxam" in a French locale.

Best Practices for POSIX Bracket Expressions

Know your regex engine: Not all engines fully support POSIX bracket expressions, collating sequences, or character equivalents.
Be careful with negation: Make sure you understand how to negate POSIX bracket expressions to avoid unexpected matches.
Use locale settings appropriately: POSIX bracket expressions adapt to the locale, making them useful for multilingual text processing.

POSIX bracket expressions extend the functionality of traditional character classes by adding locale-specific character handling, collating sequences, and character equivalents. These features are particularly useful for handling text in different languages and cultural contexts.

However, due to limited support in many regex engines, it's important to understand your tool’s capabilities before relying on these features. If your regex engine doesn’t fully support POSIX bracket expressions, consider using Unicode properties and scripts as an alternative.

Table of Contents

Read more...

132 views
0 comments

XML Schema Character Classes and Subtraction Explained (Page 24)

By Jessica Brown
January 10Jan 10
Tutorials

XML Schema introduces unique character classes and features not commonly found in other regular expression flavors. These classes are particularly useful for validating XML names and values, making XML Schema regex syntax essential for working with XML data.

Special Character Classes in XML Schema

In addition to the six standard shorthand character classes (e.g., \d for digits, \w for word characters), XML Schema introduces four unique shorthand character classes designed specifically for XML name validation:

Character Class	Description	Equivalent
`\i`	Matches any valid first character of an XML name	`[_:A-Za-z]`
`\c`	Matches any valid subsequent character in an XML name	`[-._:A-Za-z0-9]`
`\I`	Negated version of `\i` (invalid first characters)	Not supported elsewhere
`\C`	Negated version of `\c` (invalid subsequent characters)	Not supported elsewhere

These character classes simplify the creation of regex patterns for XML validation. For example, to match a valid XML name, you can use:

\i\c*

This regex matches an XML name like "xml:schema". Without these shorthand classes, the same pattern would need to be written as:

[_:A-Za-z][-._:A-Za-z0-9]*

The shorthand version is much more concise and easier to read.

Practical Examples Using XML Schema Character Classes

Here are some common use cases for these shorthand classes in XML validation:

Pattern	Description
`<\i\c\s>`	Matches an opening XML tag with no attributes
`</\i\c\s>`	Matches a closing XML tag
`<\i\c(\s+\i\c\s=\s("[^"]*"	'[^']'))\s*>`

For example, the pattern:

<(\i\c*(\s+\i\c*\s*=\s*("[^"]*"|'[^']*'))*|/\i\c*)\s*>

Matches both opening tags with attributes and closing tags.

Character Class Subtraction in XML Schema

XML Schema introduces a powerful feature called character class subtraction, which allows you to exclude certain characters from a class. The syntax for character class subtraction is:

[class-[subtract]]

This feature simplifies regex patterns that would otherwise be lengthy or complex. For example:

[a-z-[aeiou]]

This pattern matches any lowercase letter except vowels (i.e., consonants). Without class subtraction, you’d have to list all consonants explicitly:

[b-df-hj-np-tv-z]

Character class subtraction is more than just a shortcut — it allows you to use complex character class syntax within the subtracted class. For instance:

[\p{L}-[\p{IsBasicLatin}]]

This matches all Unicode letters except basic ASCII letters, effectively targeting non-English letters.

Nested Character Class Subtraction

One of the more advanced features of XML Schema regex is nested class subtraction, where you can subtract a class from another class that is already being subtracted. For example:

[0-9-[0-6-[0-3]]]

Let’s break this down:

0-6 matches digits from 0 to 6.
Subtracting 0-3 leaves 4-6.
The final class becomes 0-9-[4-6], which matches "0123789".

Important Rules for Class Subtraction

The subtraction must always be the last element in the character class. For example:
✅ Correct: [0-9a-f-[4-6]]
❌ Incorrect: [0-9-[4-6]a-f]
Subtraction applies to the entire class, not just the last part. For example:
```
[\p{Ll}\p{Lu}-[\p{IsBasicLatin}]]
```
This pattern matches all uppercase and lowercase Unicode letters, excluding basic ASCII letters.

Notational Compatibility with Other Regex Flavors

While character class subtraction is a unique feature of XML Schema, it’s also supported by .NET and JGsoft regex engines. However, most other regex flavors (like Perl, JavaScript, and Python) don’t support this feature.

If you try to use a pattern like [a-z-[aeiou]] in a regex engine that doesn’t support class subtraction, it won’t throw an error — but it won’t behave as expected either. Instead, it will interpret the pattern as:

[a-z-[aeiou]]

This is treated as a character class followed by a literal closing bracket (]), which is not what you intended. The pattern will match:

Any lowercase letter (a-z)
A hyphen (-)
An opening bracket ([)
Any vowel (aeiou)

Because of this, be cautious when using character class subtraction in cross-platform regex patterns. Stick to traditional character classes if compatibility is a concern.

Best Practices for XML Schema Regex

When using XML Schema regular expressions:

Leverage shorthand character classes like \i and \c to simplify patterns.
Use character class subtraction to exclude specific characters, especially when working with Unicode.
Be mindful of compatibility with other regex flavors. XML Schema regex syntax may not work in Perl, JavaScript, or Python without modification.

Summary of XML Schema Regex Features

Feature	Description	Example
`\i`	Matches valid first characters in XML names	`<\i\c*>`
`\c`	Matches valid subsequent characters in XML names	`<\i\c(\s+\i\c\s=\s".?")>`
Character Class Subtraction	Excludes characters from a class	`[a-z-[aeiou]]`
Nested Class Subtraction	Subtracts a class from an already subtracted class	`[0-9-[0-6-[0-3]]]`
Compatibility Considerations	Be cautious with subtraction in cross-platform patterns	`[a-z-[aeiou]]` in Perl behaves differently

XML Schema regular expressions introduce useful shorthand character classes and the powerful feature of character class subtraction, making them essential for validating XML documents efficiently. However, it’s important to understand the limitations and compatibility issues when using these features outside of XML Schema-specific environments.

By mastering these features, you’ll be able to write concise, effective regex patterns for parsing and validating XML content.

Table of Contents

Read more...

148 views
0 comments

Using If-Then-Else Conditionals in Regular Expressions (Page 23)

By Jessica Brown
January 10Jan 10
Tutorials

Conditional logic isn’t limited to programming languages — many modern regular expression engines allow if-then-else conditionals. This feature lets you apply different matching patterns based on a condition. The syntax for conditionals is:

(?(condition)then|else)

If the condition is met, the then part is attempted. If the condition is not met, the else part is applied instead. You can omit the else part if it’s not needed.

Conditional Syntax and How It Works

The syntax for if-then-else conditionals uses parentheses, starting with (?. The condition can either be:

A lookaround assertion (e.g., a lookahead or lookbehind).
A reference to a capturing group to check if it participated in the match.

Here’s how you can structure the syntax:

(?(?=regex)then|else)   # Using a lookahead as a condition  
(?(1)then|else)         # Using a capturing group as a condition

In the first example, the condition checks if a lookahead pattern is true. In the second example, it checks whether the first capturing group took part in the match.

Using Lookahead in Conditionals

Lookaround assertions (like lookahead) allow you to test if a certain pattern exists without consuming characters in the string. For example:

(?(?=\d{3})A|B)

In this pattern, if the next three characters are digits (\d{3}), the regex matches "A". If not, it matches "B". The lookahead doesn’t consume any characters, so the main regex continues at the same position after the conditional.

Using Capturing Groups in Conditionals

You can also check whether a capturing group has matched something earlier in the pattern. For example:

(a)?b(?(1)c|d)

This pattern checks if the first capturing group (containing "a") took part in the match:

If "a" was captured, the engine attempts to match "c" after "b".
If "a" wasn’t captured, it attempts to match "d" instead.

Example Walkthrough: `(a)?b(?(1)c|d)`

Let’s see how the regex (a)?b(?(1)c|d) behaves when applied to different strings:

String	Match?	Explanation
"bd"	✅ Yes	The first group doesn’t match "a", so it uses the else part and matches "d" after "b".
"abc"	✅ Yes	The first group captures "a", so the then part matches "c" after "b".
"bc"	❌ No	The first group doesn’t match "a", so it tries "d" after "b", but fails to match "c".
"abd"	✅ Yes	The first group captures "a", but "c" fails to match "d". The engine retries and matches "bd" starting at the second character.

Optimizing the Pattern with Anchors

If you want to avoid unexpected matches like in the "abd" case, you can use anchors to ensure the pattern matches the entire string:

^(a)?b(?(1)c|d)$

This version only matches strings that fully adhere to the pattern. For example, it won’t match "abd", because the conditional fails when the "then" part doesn’t match.

Conditionals in Different Regex Engines

Not all regex engines support if-then-else conditionals. Here’s a quick overview of support across popular engines:

Regex Engine	Supports Conditionals?	Notes
Perl	✅ Yes	Offers the most flexibility with conditionals and capturing groups.
PCRE	✅ Yes	Widely used in programming languages like PHP.
.NET	✅ Yes	Supports both numbered and named capturing groups.
Python	✅ Yes	Supports conditionals with capturing groups, but not with lookaround.
JavaScript	❌ No	Does not support conditionals in regex.

In engines like .NET, you can use named capturing groups for more readable conditionals:

(?<test>a)?b(?(test)c|d)

Example: Extracting Email Headers with Conditionals

Let’s apply conditionals to a practical example: extracting email headers from a message. Consider the following pattern:

^((From|To)|Subject): ((?(2)\w+@\w+\.[a-z]+|.+))

Here’s how it works:

The first part ((From|To)|Subject) captures the header name.
The conditional (?(2)...|...) checks if the second capturing group matched either "From" or "To".
- If it did, it matches an email address with \w+@\w+\.[a-z]+.
- If not, it matches any remaining text on the line with .+.

For example:

Input	Header Captured	Value Captured
"From: alice@example.com"	From	alice@example.com
"Subject: Meeting Notes"	Subject	Meeting Notes

Simplifying Complex Patterns

While conditionals can be useful, they can also make regular expressions difficult to read and maintain. In some cases, it’s better to use simpler patterns and handle the conditional logic in your code.

For example, instead of using a complex pattern like this:

^((From|To)|(Date)|Subject): ((?(2)\w+@\w+\.[a-z]+|(?(3)mm/dd/yyyy|.+)))

You could simplify it to:

^(From|To|Date|Subject): (.+)

Then, in your code, you can process each header separately based on what was captured in the first group. This approach is easier to maintain and often faster.

Summary

If-then-else conditionals in regular expressions provide a way to handle multiple match possibilities based on conditions. Whether you use capturing groups or lookaround assertions, this feature allows you to create more dynamic and flexible patterns.

However, because conditionals can make regex patterns more complex, use them carefully. In many cases, handling conditional logic in your code can be a cleaner and more efficient solution.

Pattern	Description
`(?(1)c	d)`
`(?(?=\d{3})A	B)`
`(?a)?b(?(test)c	d)`

By understanding how to use conditionals, you can build more powerful and efficient regular expressions for various tasks like text parsing, validation, and data extraction.

Table of Contents

Read more...

137 views
0 comments

Understanding the \G Anchor in Regular Expressions (Page 22)

By Jessica Brown
January 10Jan 10
Tutorials

The \G anchor is a powerful tool in regular expressions, allowing matches to continue from the point where the previous match ended. It behaves similarly to the start-of-string anchor \A on the first match attempt, but its real utility shines when used in consecutive matches within the same string.

How the `\G` Anchor Works

The anchor \G matches the position immediately following the last successful match. During the initial match attempt, it behaves like \A, matching the start of the string. On subsequent attempts, it only matches at the point where the previous match ended.

For example, applying the regex \G\w to the string "test string" works as follows:

The first match finds "t" at the beginning of the string.
The second match finds "e" immediately after the first match.
The third match finds "s", and the fourth match finds the second "t".
The fifth attempt fails because the position after the second "t" is followed by a space, which is not a word character.

This behavior makes \G particularly useful for iterating through a string and applying patterns step-by-step.

Key Difference: End of Previous Match vs. Start of Match Attempt

The behavior of \G can vary between different regex engines and tools.

In some environments, such as EditPad Pro, \G matches at the start of the match attempt rather than at the end of the previous match.
In EditPad Pro, the text cursor’s position determines where \G matches. After a match is found, the text cursor moves to the end of that match. As long as you don’t move the cursor between searches, \G behaves as expected and matches where the previous match left off. This behavior is logical in the context of text editors.

Using `\G` in Perl

In Perl, \G has a unique behavior due to its “magical” position tracking. The position of the last match is stored separately for each string variable, allowing one regex to pick up exactly where another left off.

This position tracking isn’t tied to any specific regex but is instead associated with the string itself. This flexibility allows developers to chain multiple regex patterns together to process a string in a step-by-step manner.

Important Tip: Using the `/c` Modifier

If a match attempt fails in Perl, the position tracked by \G resets to the start of the string. To prevent this, you can use the /c modifier, which keeps the position unchanged after a failed match.

Example: Parsing an HTML File with `\G` in Perl

Here’s a practical example of using \G in Perl to process an HTML file:

while ($string =~ m/</g) {  
    if ($string =~ m/\GB>/c) {  
        # Bold tag  
    } elsif ($string =~ m/\GI>/c) {  
        # Italics tag  
    } else {  
        # Other tags  
    }  
}

In this example, the initial regex inside the while loop finds the opening angle bracket (<). The subsequent regex patterns, using \G, check whether the tag is a bold () or italics () tag. This approach allows you to process the tags in the order they appear without needing a massive, complex regex to handle all possible tags at once.

`\G` in Other Programming Languages

While Perl offers extensive flexibility with \G, its behavior in other languages can be more restricted.

In Java, for example, the position tracked by \G is managed by the Matcher object, which is tied to a specific regular expression and subject string. You can manually configure a second Matcher to start at the end of the first match, allowing \G to match at that position.
Other languages and engines that support \G include .NET, Java, PCRE, and the JGsoft engine.

Summary

The \G anchor is a valuable tool for continuing regex matches from where the last match left off. While its behavior varies across different tools and languages, it provides a powerful way to process strings incrementally.

Here are a few key takeaways:

Feature	Description
`\G`	Matches at the position where the previous match ended
First Match Behavior	Acts like `\A`, matching the start of the string
Subsequent Matches	Matches immediately after the last successful match
Usage in Perl	Tracks the end of the previous match for each string variable
`/c` Modifier in Perl	Prevents the position from resetting to the start after a failed match
Supported Languages	.NET, Java, PCRE, JGsoft engine, and Perl

By understanding \G, you can write more efficient and maintainable regex patterns that process strings in a structured, step-by-step manner.

Table of Contents

Read more...

184 views
0 comments

Testing Multiple Conditions on the Same Part of a String with Lookaround (Page 21)

By Jessica Brown
January 10Jan 10
Tutorials

In regular expressions, it’s common to need a match that satisfies multiple conditions simultaneously. This is where lookahead and lookbehind, collectively known as lookaround assertions, come in handy. These zero-width assertions allow the regex engine to test conditions without consuming characters in the string, making it possible to apply multiple requirements to the same portion of text.

Why Lookaround Is Essential

Let’s say you want to match a six-letter word that contains the sequence “cat.” You could achieve this using multiple patterns combined with alternation, like this:

cat\w{3}|\wcat\w{2}|\w{2}cat\w|\w{3}cat

This approach works, but it becomes tedious and inefficient if you need to find words between 6 and 12 letters that contain different sequences like “cat,” “dog,” or “mouse.” In such cases, lookaround simplifies things considerably.

Using Lookahead to Match Multiple Requirements

To break down the process, let’s start with two simple conditions:

The word must be exactly six letters long.
The word must contain the sequence “cat.”

We can easily match a six-letter word using \b\w{6}\b and a word containing “cat” with \b\w*cat\w*\b. Combining both requirements with lookahead gives us:

(?=\b\w{6}\b)\b\w*cat\w*\b

Here’s how this works:

The positive lookahead (?=\b\w{6}\b) ensures the current position is at the start of a six-letter word.
Once the lookahead matches a six-letter word, the regex engine proceeds to check if the word contains “cat.”
If the word contains “cat,” the regex matches the entire word. If not, the engine moves to the next character and tries again.

Optimizing the Regex

While the above solution works, we can optimize it further for better performance. Let’s break down the optimization process:

Removing unnecessary word boundaries
Since the second word boundary \b is guaranteed to match wherever the first one did, we can remove it:
(?=\b\w{6}\b)\w*cat\w*
Optimizing the initial \w*
In a six-letter word containing “cat,” there can be a maximum of three letters before “cat.” So instead of using \w*, we can limit it to match up to three characters:
```
(?=\b\w{6}\b)\w{0,3}cat\w* 
```
Adjusting the word boundary
The first word boundary \b doesn’t need to be inside the lookahead. We can move it outside for a cleaner expression:
\b(?=\w{6}\b)\w{0,3}cat\w*

This final regex is more efficient and easier to read. It ensures that the regex engine does minimal backtracking and quickly identifies six-letter words containing "cat."

A More Complex Example

Now, let’s say you want to find any word between 6 and 12 letters long that contains “cat,” “dog,” or “mouse.” You can use a similar approach with a lookahead to enforce the length requirement and a capturing group to match the specific sequences:

\b(?=\w{6,12}\b)\w{0,9}(cat|dog|mouse)\w*

Breaking It Down:

\b(?=\w{6,12}\b) ensures the word is between 6 and 12 letters long.
\w{0,9} matches up to nine characters before one of the specified sequences.
(cat|dog|mouse) captures the sequence we’re looking for.
\w* matches the remaining characters in the word.

This pattern will successfully match any word within the specified length range that contains one of the target sequences. Additionally, the matching sequence ("cat," "dog," or "mouse") will be captured in a backreference for further use if needed.

Lookaround assertions are powerful tools for creating efficient regular expressions that test multiple conditions on the same portion of text. By understanding how lookahead and lookbehind work and applying optimization techniques, you can create regex patterns that are both effective and efficient. Once you master lookaround, you'll find it invaluable for solving complex text-matching problems in a clean and concise way.

Optimized Example:

\b(?=\w{6}\b)\w{0,3}cat\w*

More Complex Example:

\b(?=\w{6,12}\b)\w{0,9}(cat|dog|mouse)\w*

With these patterns, you can handle even the most complex matching requirements with ease!

Table of Contents

Read more...

137 views
0 comments

Understanding Lookahead and Lookbehind in Regular Expressions (Lookaround) (Page 20)

By Jessica Brown
January 10Jan 10
Tutorials

Lookahead and lookbehind, often referred to collectively as "lookaround," are powerful constructs introduced in Perl 5 and supported by most modern regular expression engines. They are also known as zero-width assertions because they don’t consume characters in the input string. Instead, they simply assert whether a certain condition is true at a given position without including the matched text in the overall match result.

Lookaround constructs allow you to build more flexible and efficient regex patterns that would otherwise be lengthy or impossible to achieve using traditional methods.

What Are Zero-Width Assertions?

Zero-width assertions, like start (^) and end ($) anchors, match positions in a string rather than actual characters. The key difference is that lookaround assertions inspect the text ahead or behind a position to check if a certain pattern is possible, without moving the regex engine's position in the string.

For example, a positive lookahead ensures that a specific pattern follows a certain point, while a negative lookahead ensures that a certain pattern does not follow.

Positive and Negative Lookahead

Lookahead assertions check what comes after a certain position in the string without including it in the match.

Positive Lookahead (`(?=...)`)

A positive lookahead ensures that a particular sequence of characters follows the current position. For example, the regex q(?=u) matches the letter "q" only if it’s immediately followed by a "u," but it doesn’t include the "u" in the match result.

Negative Lookahead (`(?!...)`)

A negative lookahead ensures that a specific sequence does not follow the current position. For instance, q(?!u) matches a "q" only if it’s not followed by a "u."

Here’s how the regex engine processes the negative lookahead q(?!u) when applied to different strings:

For the string "Iraq", the regex matches the "q" because there’s no "u" immediately after it.
For the string "quit", the regex does not match the "q" because it’s followed by a "u."

Positive and Negative Lookbehind

Lookbehind assertions work similarly but check what comes before the current position in the string.

Positive Lookbehind (`(?<=...)`)

A positive lookbehind ensures that a specific pattern precedes the current position. For example, (?<=a)b matches the letter "b" only if it’s preceded by an "a."

In the word "cab", the regex matches the "b" because it’s preceded by an "a."
In the word "bed", the regex does not match the "b" because it’s preceded by a "d."

Negative Lookbehind (`(?<!...)`)

A negative lookbehind ensures that a certain pattern does not precede the current position. For example, (?<!a)b matches a "b" only if it’s not preceded by an "a."

In the word "bed", the regex matches the "b" because it’s not preceded by an "a."
In the word "cab", the regex does not match the "b" because it is preceded by an "a."

Using Lookbehind for More Complex Patterns

Unlike lookahead, which allows any regular expression inside, lookbehind assertions are more limited in some regex flavors. Many engines require lookbehind patterns to have a fixed length because the regex engine needs to know exactly how far to step back in the string.

For example, the regex (?<=abc)d will match the "d" in the string "abcd", but the lookbehind must be of fixed length in engines like Python and Perl.

Some modern engines, such as Java and PCRE, allow lookbehind patterns of varying lengths, provided they have a finite maximum length. For example, (?<=a|ab|abc)d would be valid in these engines, as each alternative has a fixed length.

Lookaround in Practice: A Comparison

Consider the following two regex patterns for matching words that don’t end with "s":

\b\w+(?<!s)\b
\b\w+[^s]\b

When applied to the word "John's", the first pattern matches "John", while the second matches "John'" (including the apostrophe). The first pattern is generally more accurate and easier to understand.

Limitations of Lookbehind

Not all regex flavors support lookbehind. For instance, JavaScript and Ruby support lookahead but do not support lookbehind. Additionally, even in engines that support lookbehind, some limitations apply:

Fixed-length requirement: Most regex flavors require lookbehind patterns to have a fixed length.
No repetition: You cannot use quantifiers like * or + inside lookbehind.

The only regex engines that allow full regular expressions inside lookbehind are the JGsoft engine and the .NET framework.

The Atomic Nature of Lookaround

One important characteristic of lookaround assertions is that they are atomic. This means that once the lookaround condition is satisfied, the regex engine does not backtrack to try other possibilities inside the lookaround.

For example, consider the regex (?=(\d+))\w+\1 applied to the string "123x12":

The lookahead (?=(\d+)) matches the digits "123" and captures them into \1.
The \w+ token matches the entire string.
The engine backtracks until \w+ matches only the "1" at the start of the string.
The engine tries to match \1 but fails because it cannot find "123" again at any position.

Since lookaround is atomic, the backtracking steps inside the lookahead are discarded, preventing further permutations from being tried.

However, if you apply the same regex to the string "456x56", it will match "56x56" because the backtracking steps align with the repeated digits.

Summary

Lookahead and lookbehind are essential tools for creating complex regex patterns. They allow you to assert conditions without consuming characters in the string.

Quick Reference for Lookaround Constructs:

Construct	Description	Example	Matches	Does Not Match
`(?=...)`	Positive Lookahead	`q(?=u)`	"quit"	"qit"
`(?!...)`	Negative Lookahead	`q(?!u)`	"qit"	"quit"
`(?<=...)`	Positive Lookbehind	`(?<=a)b`	"cab"	"bed"
`(?<!...)`	Negative Lookbehind	`(?<!a)b`	"bed"	"cab"

Use lookaround assertions carefully to optimize your regex patterns without accidentally excluding valid matches.

Table of Contents

Read more...

145 views
0 comments

Understanding Atomic Grouping in Regular Expressions (Page 19)

By Jessica Brown
January 10Jan 10
Tutorials

Atomic grouping is a powerful tool in regular expressions that helps optimize pattern matching by preventing unnecessary backtracking. Once the regex engine exits an atomic group, it discards all backtracking points created within that group, making it more efficient. Unlike regular groups, atomic groups are non-capturing, and their syntax is represented by (?:?>group). Lookaround assertions like (?=...) and (?!...) are inherently atomic as well.

Atomic grouping is supported by many popular regex engines, including Java, .NET, Perl, Ruby, PCRE, and JGsoft. Additionally, some of these engines (such as Java and PCRE) offer possessive quantifiers, which act as shorthand for atomic groups.

How Atomic Groups Work: A Practical Example

Consider the following example:

The regular expression a(bc|b)c uses a capturing group and matches both "abcc" and "abc".
In contrast, the expression a(?>bc|b)c includes an atomic group and only matches "abcc", not "abc".

Here's what happens when the regex engine processes the string "abc":

For a(bc|b)c, the engine first matches a to "a" and bc to "bc". When the final c fails to match, the engine backtracks and tries the second option b inside the group. This results in a successful match with b followed by c.
For a(?>bc|b)c, the engine matches a to "a" and bc to "bc". However, since it's an atomic group, it discards any backtracking positions inside the group. When c fails to match, the engine has no alternatives left to try, causing the match to fail.

While this example is simple, it highlights the primary benefit of atomic groups: preventing unnecessary backtracking, which can significantly improve performance in certain situations.

Using Atomic Groups for Regex Optimization

Let’s explore a practical use case for optimizing a regular expression:

Imagine you're using the pattern \b(integer|insert|in)\b to search for specific words in a text. When this pattern is applied to the string "integers", the regex engine performs several steps before determining there’s no match.

It matches the word boundary \b at the start of the string.
It matches "integer", but the following boundary \b fails between "r" and "s".
The engine backtracks and tries the next alternative, "in", which also fails to match the remainder of the string.

This process involves multiple backtracking attempts, which can be time-consuming, especially with large text files.

By converting the capturing group into an atomic group using \b(?>integer|insert|in)\b, we eliminate unnecessary backtracking. Once "integer" matches, the engine exits the atomic group and stops considering other alternatives. If \b fails, the engine moves on without trying "insert" or "in", making the process much more efficient.

This optimization is particularly valuable when your pattern includes repeated tokens or nested groups that could cause catastrophic backtracking.

A Word of Caution

While atomic grouping can improve performance, it’s essential to use it wisely. There are situations where atomic groups can inadvertently prevent valid matches.

For example:

The regex \b(?>integer|insert|in)\b will match the word "insert".
However, changing the order of the alternatives to \b(?>in|integer|insert)\b will cause the same pattern to fail to match "insert".

This happens because alternation is evaluated from left to right, and atomic groups prevent further attempts once a match is made. If the atomic group matches "in", it won’t check for "integer" or "insert".

In scenarios where all alternatives should be considered, it’s better to avoid atomic groups.

Atomic grouping is a powerful technique to reduce backtracking in regular expressions, improving performance and preventing excessive match attempts. However, it’s crucial to understand its behavior and apply it thoughtfully to avoid unintentionally excluding valid matches. Proper use of atomic groups can make your regex patterns more efficient, especially when dealing with large datasets or complex patterns.

Table of Contents

Read more...

178 views
0 comments

Possessive Quantifiers (Page 18)

By Jessica Brown
January 10Jan 10
Tutorials

When working with repetition operators (also known as quantifiers) in regular expressions, it’s essential to understand the difference between greedy, lazy, and possessive quantifiers. Greedy and lazy quantifiers affect the order in which the regex engine tries to match permutations of the pattern. However, both types still allow the regex engine to backtrack through the pattern to find a match. Possessive quantifiers take a different approach—they do not allow backtracking once a match is made, which can impact performance and alter match results.

How Possessive Quantifiers Work

Possessive quantifiers are a feature of some modern regex engines, including JGsoft, Java, and PCRE. These quantifiers behave like greedy quantifiers by attempting to match as many characters as possible. However, once a match is made, possessive quantifiers lock in the match and refuse to give up characters during backtracking.

You can make a quantifier possessive by adding a + after it:

* (greedy) matches zero or more times.
*? (lazy) matches as few times as possible.
*+ (possessive) matches zero or more times but refuses to backtrack.

Other possessive quantifiers include ++, ?+, and {n,m}+.

Example of Possessive Quantifiers in Action

Consider the regex pattern "[^"]*+" applied to the string "abc":

The first " matches the opening quote.
The [^\"]*+ matches the characters abc within the quotes.
The final " matches the closing quote.

In this case, the possessive quantifier behaves similarly to a greedy quantifier. However, if the string lacks a closing quote, the regex will fail faster with a possessive quantifier because there are no backtracking steps to try.

For instance, when applied to the string "abc, the possessive quantifier prevents the regex engine from backtracking to try alternate matches, immediately resulting in a failure when it encounters the missing closing quote. In contrast, a greedy quantifier would continue backtracking unnecessarily, trying to find a match.

When Possessive Quantifiers Matter

Possessive quantifiers are particularly useful for optimizing regex performance by preventing excessive backtracking. This is especially valuable in cases where:

You expect a match to fail.
The pattern includes nested quantifiers.

By using possessive quantifiers, you can reduce or eliminate catastrophic backtracking, which can slow down your regex significantly.

How Possessive Quantifiers Can Change Match Results

Possessive quantifiers can alter the outcome of a match. For example:

The pattern ".*" applied to the string "abc"x will match "abc".
The pattern ".*+" applied to the same string will fail to match because the possessive quantifier locks in the entire string, including the extra character x, preventing the second quote from matching.

This demonstrates that possessive quantifiers should be used carefully. The part of the pattern that follows the possessive quantifier must not be able to match any characters already consumed by the quantifier.

Using Atomic Grouping Instead of Possessive Quantifiers

Atomic groups offer a similar function to possessive quantifiers. They prevent backtracking within the group, making them a useful alternative for regex flavors that don’t support possessive quantifiers.

To create an atomic group, use the syntax (?>X*) instead of X*+. For example:

(?:a|b)*+ is equivalent to (?>(?:a|b)*).

The key difference is that the quantified token and the quantifier must be inside the atomic group for the effect to be the same. If the atomic group only surrounds the alternation (e.g., (?>a|b)*), the behavior will differ.

Example Comparison

Consider the following examples:

(?:a|b)*+b and (?>(?:a|b)*)b will both fail to match the string b because the possessive quantifier or atomic group prevents the pattern from backtracking.
In contrast, (?>a|b)*b will match b. The atomic group ensures that each alternation (a or b) doesn’t backtrack, but the outer greedy quantifier allows backtracking to match the final b.

Practical Tip for Conversion

When converting a regex from a flavor that supports possessive quantifiers to one that doesn’t, you can replace possessive quantifiers with atomic groups. For instance:

Replace X*+ with (?>(X*)).
Replace (?:a|b)*+ with (?>(?:a|b)*).

Using 3rd party tools can automate this conversion process and ensure compatibility across different regex flavors.

Table of Contents

Read more...

149 views
0 comments

Regex Matching Modes (Page 17)

By Jessica Brown
January 9Jan 9
Tutorials

Most regular expression engines discussed in this tutorial support the following four matching modes:

Modifier	Description
/i	Makes the regex case-insensitive.
/s	Enables "single-line mode," making the dot (`.`) match newlines.
/m	Enables "multi-line mode," allowing caret (`^`) and dollar (`$`) to match at the start and end of each line.
/x	Enables "free-spacing mode," where whitespace is ignored, and `#` can be used for comments.

Specifying Modes Inside The Regular Expression

You can specify these modes within a regex using mode modifiers. For example:

(?i) turns on case-insensitive matching.
(?s) enables single-line mode.
(?m) enables multi-line mode.
(?x) enables free-spacing mode.

Example:

(?i)hello matches "HELLO"

Turning Modes On and Off for Only Part of the Regex

Modern regex flavors allow you to apply modifiers to specific parts of the regex:

(?i-sm) turns on case-insensitive mode while turning off single-line and multi-line modes.

To apply a modifier to only a part of the regex, you can use the following syntax:

(?i)word(?-i)Word

This pattern makes "word" case-insensitive but "Word" case-sensitive.

Modifier Spans

Modifier spans apply modes to a specific section of the regex:

(?i:word) makes "word" case-insensitive.
(?i:case)(?-i:sensitive) applies mixed modes within the regex.

Example:

(?i:ignorecase)(?-i:casesensitive)

Understanding matching modes is essential for writing efficient and accurate regex patterns. By leveraging modes like case-insensitivity, single-line, multi-line, and free-spacing, you can create more flexible and maintainable regular expressions.

Table of Contents

Read more...

172 views
0 comments

Unicode Regular Expressions (Page 16)

By Jessica Brown
January 9Jan 9
Tutorials

Unicode regular expressions are essential for working with text in multiple languages and character sets. As the world becomes more interconnected, supporting Unicode is increasingly important for ensuring that software can handle diverse text inputs.

What is Unicode?

Unicode is a standardized character set that encompasses characters and glyphs from all human languages, both living and dead. It aims to provide a consistent way to represent characters from different languages, eliminating the need for language-specific character sets.

Challenges with Unicode in Regular Expressions

Working with Unicode introduces unique challenges:

Characters, Code Points, and Graphemes:
- A single character (grapheme) may be represented by multiple code points. For example, the letter "à" can be represented as:
  - A single code point: U+00E0
  - Two code points: U+0061 ("a") + U+0300 (grave accent)
- Regular expressions that treat code points as characters may fail to match graphemes correctly.
Combining Marks:
- Combining marks are code points that modify the preceding character. For example, U+0300 (grave accent) is a combining mark that can be applied to many base characters.

Matching Unicode Graphemes

To match a single Unicode grapheme (character), use:

Perl, RegexBuddy, PowerGREP: \X
Java, .NET: \P{M}\p{M}*

Example:

\X matches a grapheme
\P{M}\p{M}* matches a base character followed by zero or more combining marks

Matching Specific Code Points

To match a specific Unicode code point, use:

JavaScript, .NET, Java: \uFFFF (FFFF is the hexadecimal code point)
Perl, PCRE: \x{FFFF}

Unicode Character Properties

Unicode defines properties that categorize characters based on their type. You can match characters belonging to specific categories using:

Positive Match: \p{Property}
Negative Match: \P{Property}

Common Properties:

\p{L} - Letter
\p{Lu} - Uppercase Letter
\p{Ll} - Lowercase Letter
\p{N} - Number
\p{P} - Punctuation
\p{S} - Symbol
\p{Z} - Separator
\p{C} - Other (Control Characters)

Unicode Scripts and Blocks

Unicode groups characters into scripts and blocks:

Scripts: Collections of characters used by a particular language or writing system.
Blocks: Contiguous ranges of code points.

Example Scripts:

\p{Latin}
\p{Greek}
\p{Cyrillic}

Example Blocks:

\p{InBasic_Latin}
\p{InGreek_and_Coptic}
\p{InCyrillic}

Best Practices for Unicode Regex

Use \X to match graphemes when supported.
Be aware of different ways to encode characters.
Normalize input to avoid mismatches due to different encodings.
Use Unicode properties to match character categories.
Use scripts and blocks to match specific writing systems.

Table of Contents

Read more...

137 views
0 comments

Named Capturing Groups (Page 15)

By Jessica Brown
January 9Jan 9
Tutorials

Named capturing groups allow you to assign names to capturing groups, making it easier to reference them in complex regular expressions. This feature is available in most modern regular expression engines.

Why Use Named Capturing Groups?

In traditional regular expressions, capturing groups are referenced by their numbers (e.g., \1, \2). As the number of groups increases, it becomes harder to manage and understand which group corresponds to which part of the match. Named capturing groups solve this problem by allowing you to reference groups by descriptive names.

Example (Traditional):

(\d{4})-(\d{2})-(\d{2})

In this pattern, you would reference the year as \1, the month as \2, and the day as \3.

Example (Named):

(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})

Now, you can reference the year as year, the month as month, and the day as day, making the regex more readable and maintainable.

Named Capture Syntax by Flavor

Python, PCRE, and PHP

These flavors use the following syntax for named capturing groups:

(?P<name>group)

To reference the named group inside the regex, use:

(?P=name)

To reference it in replacement text, use:

\g<name>

Example:

(?P<word>\w+)\s+(?P=word)

This pattern matches doubled words like "the the".

.NET Framework

The .NET regex engine uses its own syntax for named capturing groups:

(?<name>group) or (?'name'group)

To reference the named group inside the regex, use:

\k<name> or \k'name'

In replacement text, use:

${name}

Example:

(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})

This pattern matches a date in YYYY-MM-DD format. You can reference the named groups in replacement text like:

${year}/${month}/${day}

Multiple Groups with the Same Name

In the .NET framework, you can have multiple capturing groups with the same name. This is useful when you have different patterns that should capture the same kind of data.

Example:

a(?<digit>[0-5])|b(?<digit>[4-7])

In this pattern, both groups are named digit. The capturing group will contain the matched digit, regardless of which alternative was matched.

Note:

Python and PCRE do not allow multiple groups with the same name. Attempting to do so will result in a compilation error.

Numbering of Named Groups

The way capturing groups are numbered varies between regex flavors:

Python and PCRE

Both named and unnamed capturing groups are numbered from left to right.

(a)(?P<x>b)(c)(?P<y>d)

In this pattern:

Group 1: (a)
Group 2: (?P<x>b)
Group 3: (c)
Group 4: (?P<y>d)

In replacement text, you can reference these groups as \1, \2, \3, and \4.

.NET Framework

The .NET framework handles named groups differently. Named groups are numbered after all unnamed groups.

(a)(?<x>b)(c)(?<y>d)

In this pattern:

Group 1: (a)
Group 2: (c)
Group 3: (?<x>b)
Group 4: (?<y>d)

In replacement text, you would reference the groups as:

$1 for (a)
$2 for (c)
$3 for (?<x>b)
$4 for (?<y>d)

To avoid confusion, it’s best to reference named groups by their names rather than their numbers in the .NET framework.

Best Practices

To ensure compatibility across different regex flavors and avoid confusion, follow these best practices:

Do not mix named and unnamed groups. Use either all named groups or all unnamed groups.
Use non-capturing groups for parts of the regex that don’t need to be captured:

(?:group)

Use descriptive names for capturing groups to make your regex more readable.

JGsoft Engine

The JGsoft regex engine (used in tools like EditPad Pro and PowerGREP) supports both Python-style and .NET-style named capturing groups.

Python-style named groups are numbered along with unnamed groups.
.NET-style named groups are numbered after unnamed groups.
Multiple groups with the same name are allowed.

Summary

Named capturing groups make regular expressions more readable and maintainable. Different regex flavors have varying syntaxes and behaviors for named groups. To write portable and efficient regex patterns:

Use named groups to improve readability.
Avoid mixing named and unnamed groups.
Use non-capturing groups when capturing is unnecessary.

By understanding how different regex engines handle named groups, you can write more robust and compatible regex patterns across various programming languages and tools.

Table of Contents

Read more...

131 views
0 comments

Grouping with Round Brackets (Page 14)

By Jessica Brown
January 9Jan 9
Tutorials

In regular expressions, round brackets (()) are used for grouping. Grouping allows you to apply operators to multiple tokens at once. For example, you can make an entire group optional or repeat the entire group using repetition operators.

Basic Usage

For example:

Set(Value)?

This pattern matches:

"Set"
"SetValue"

The round brackets group "Value", and the question mark makes it optional.

Note:

Square brackets ([]) define character classes.
Curly braces ({}) specify repetition counts.
Only round brackets (()) are used for grouping.

Backreferences

Round brackets not only group parts of a regex but also create backreferences. A backreference stores the text matched by the group, allowing you to reuse it later in the regex or replacement text.

Example:

Set(Value)?

If "SetValue" is matched, the backreference \1 will contain "Value". If only "Set" is matched, the backreference will be empty.

To prevent creating a backreference, use non-capturing parentheses:

Set(?:Value)?

The (?: ... ) syntax disables capturing, making the regex more efficient when backreferences are not needed.

Using Backreferences in Replacement Text

Backreferences are often used in search-and-replace operations. The exact syntax for using backreferences in replacement text varies between tools and programming languages.

For example, in many tools:

\1 refers to the first capturing group.
\2 refers to the second capturing group, and so on.

In replacement text, you can use these backreferences to reinsert matched text:

Find:  (\w+)\s+\1
Replace:  \1

This pattern finds doubled words like "the the" and replaces them with a single instance.

Using Backreferences in the Regex

Backreferences can also be used within the regex itself to match the same text again.

Example:

<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>

This pattern matches an HTML tag and its corresponding closing tag. The opening tag name is captured in the first backreference, and \1 is used to ensure the closing tag matches the same name.

Numbering Backreferences

Backreferences are numbered based on the order of opening brackets in the regex:

The first opening bracket creates backreference \1.
The second opening bracket creates backreference \2.

Non-capturing groups do not count toward the numbering.

Example:

([a-c])x\1x\1

This pattern matches:

"axaxa"
"bxbxb"
"cxcxc"

If a group is optional and not matched, the backreference will be empty, but the regex will still work.

Looking Inside the Regex Engine

Let’s see how the regex engine processes the following pattern:

<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>

when applied to the string:

Testing <B><I>bold italic</I></B> text

The engine matches  and stores "B" in the first backreference.
It skips over the text until it finds the closing .
The backreference \1 ensures the closing tag matches the same name as the opening tag.
The entire match is bold italic.

Backreferences to Failed Groups

There’s a difference between a backreference to a group that matched nothing and one to a group that did not participate at all:

Example:

(q?)b\1

This pattern matches "b" because the optional q? matched nothing.

In contrast:

(q)?b\1

This pattern fails to match "b" because the group (q) did not participate in the match at all.

In most regex flavors, a backreference to a non-participating group causes the match to fail. However, in JavaScript, backreferences to non-participating groups match an empty string.

Forward References and Invalid References

Some modern regex flavors, like .NET, Java, and Perl, allow forward references. A forward reference is a backreference to a group that appears later in the regex.

Example:

(\2two|(one))+

This pattern matches "oneonetwo". The forward reference \2 fails at first but succeeds when the group is matched during repetition.

In most flavors, referencing a group that doesn’t exist results in an error. In JavaScript and Ruby, such references result in a zero-width match.

Repetition and Backreferences

The regex engine doesn’t permanently substitute backreferences in the regex. Instead, it uses the most recent value captured by the group.

Example:

([abc]+)=\1

This pattern matches "cab=cab".

In contrast:

([abc])+\1

This pattern does not match "cab" because the backreference holds only the last value captured by the group (in this case, "b").

Useful Example: Checking for Doubled Words

You can use the following regex to find doubled words in a text:

\b(\w+)\s+\1\b

In your text editor, replace the doubled word with \1 to remove the duplicate.

Example:

Input: "the the cat"
Output: "the cat"

Limitations

Round brackets cannot be used inside character classes. For example:

[(a)b]

This pattern matches the literal characters "a", "b", "(", and ")".

Backreferences also cannot be used inside character classes. In most flavors, \1 inside a character class is treated as an octal escape sequence.

Example:

(a)[\1b]

This pattern matches "a" followed by either \x01 (an octal escape) or "b".

Grouping with round brackets allows you to:

Apply operators to entire groups of tokens.
Create backreferences for reuse in the regex or replacement text.

Use non-capturing groups (?: ... ) to avoid creating unnecessary backreferences and improve performance. Be mindful of the limitations and differences in behavior across various regex flavors.

Table of Contents

Read more...

158 views
0 comments

Repetition with Star and Plus (Page 13)

By Jessica Brown
January 9Jan 9
Tutorials

In addition to the question mark, regex provides two more repetition operators: the asterisk (*) and the plus (+).

Basic Usage

The * (star) matches the preceding token zero or more times. The + (plus) matches the preceding token one or more times.

For example:

<[A-Za-z][A-Za-z0-9]*>

This pattern matches HTML tags without attributes:

<[A-Za-z] matches the first letter.
[A-Za-z0-9]* matches zero or more alphanumeric characters after the first letter.

This regex will match tags like:


<HTML>

If you used + instead of *, the regex would require at least one alphanumeric character after the first letter, making it match:

<HTML> but not <1>.

Limiting Repetition

Modern regex flavors allow you to limit repetitions using curly braces ({}).

Syntax:

{min,max}

min: Minimum number of matches.
max: Maximum number of matches.

Examples:

{0,} is equivalent to *.
{1,} is equivalent to +.
{3} matches exactly three repetitions.

Example:

\b[1-9][0-9]{3}\b

This pattern matches numbers between 1000 and 9999.

\b[1-9][0-9]{2,4}\b

This pattern matches numbers between 100 and 99999.

The word boundaries (\b) ensure that only complete numbers are matched.

Watch Out for Greediness!

All repetition operators (*, +, and {}) are greedy by default. This means the regex engine will try to match as much text as possible.

Example:

Consider the pattern:

<.+>

When applied to the string:

This is a <EM>first</EM> test.

You might expect it to match  and  separately. However, it will match first instead.

This happens because the + is greedy and matches as many characters as possible.

Looking Inside the Regex Engine

The first token in the regex is <, which matches the first < in the string.

The next token is the . (dot), which matches any character except newlines. The + causes the dot to repeat as many times as possible:

The dot matches E, then M, and so on.
It continues matching until the end of the string.
At this point, the > token fails to match because there are no more characters left.

The engine then backtracks and tries to reduce the match length until > matches the next character.

The final match is first.

Laziness Instead of Greediness

To fix this issue, make the quantifier lazy by adding a question mark (?😞

<.+?>

This tells the engine to match as few characters as possible.

The < matches the first <.
The . matches E.
The engine checks for > and finds a match right after EM.

The final match is , which is what we intended.

An Alternative to Laziness

Instead of using lazy quantifiers, you can use a negated character class:

<[^>]+>

This pattern matches any sequence of characters that are not >, followed by >. It avoids backtracking and improves performance.

Example:

Given the string:

This is a <EM>first</EM> test.

The regex <[^>]+> will match:

This approach is more efficient because it reduces backtracking, which can significantly improve performance in large datasets or tight loops.

The *, +, and {} quantifiers control repetition in regex. They are greedy by default, but you can make them lazy by adding a question mark (?). Using negated character classes is another way to handle repetition efficiently without backtracking.

Table of Contents

Read more...

125 views
0 comments

Optional Items (Page 12)

By Jessica Brown
January 9Jan 9
Tutorials

The question mark (?) makes the preceding token in a regular expression optional. This means that the regex engine will try to match the token if it is present, but it won’t fail if the token is absent.

Basic Usage

For example:

colou?r

This pattern matches both "colour" and "color." The u is optional due to the question mark.

You can make multiple tokens optional by grouping them with round brackets and placing a question mark after the closing bracket:

Nov(ember)?

This regex matches both "Nov" and "November."

You can use multiple optional groups to match more complex patterns. For instance:

Feb(ruary)? 23(rd)?

This pattern matches:

"February 23rd"
"February 23"
"Feb 23rd"
"Feb 23"

Important Concept: Greediness

The question mark is a greedy operator. This means that the regex engine will first try to match the optional part. It will only skip the optional part if matching it causes the entire regex to fail.

For example:

Feb 23(rd)?

When applied to the string "Today is Feb 23rd, 2003," the engine will match "Feb 23rd" rather than "Feb 23" because it tries to match as much as possible.

You can make the question mark lazy by adding another question mark after it:

Feb 23(rd)??

In this case, the regex will match "Feb 23" instead of "Feb 23rd."

Looking Inside the Regex Engine

Let’s see how the regex engine processes the pattern:

colou?r

when applied to the string "The colonel likes the color green."

The engine starts by matching the literal c with the c in "colonel."
It continues matching o, l, and o.
It then tries to match u, but fails when it reaches n in "colonel."
The question mark makes u optional, so the engine skips it and moves to r.
r does not match n, so the engine backtracks and starts searching from the next occurrence of c in the string.

The engine eventually matches color in "color green." It matches the entire word because the u was skipped, and the remaining characters matched successfully.

Summary

The question mark is a versatile operator that allows you to make parts of a regex optional. It is greedy by default, but you can make it lazy by using ??. Understanding how the regex engine processes optional items is essential for creating efficient and accurate patterns.

Table of Contents

Read more...

143 views
0 comments

Alternation with the Vertical Bar or Pipe Symbol (Page 11)

By Jessica Brown
January 9Jan 9
Tutorials

Previously, we explored how character classes allow you to match a single character out of several possible options. Alternation, on the other hand, enables you to match one of several possible regular expressions.

The vertical bar or pipe symbol (|) is used for alternation. It acts as an OR operator within a regex.

Basic Syntax

To search for either "cat" or "dog," use the pattern:

cat|dog

You can add more options as needed:

cat|dog|mouse|fish

The regex engine will match any of these options. For example:

Regex	String	Matches
cat\|dog\|mouse\|fish	"I have a cat and a dog"	✅ Yes
cat\|dog\|mouse\|fish	"I have a fish"	✅ Yes

Precedence and Grouping

The alternation operator has the lowest precedence among all regex operators. This means the regex engine will try to match everything to the left or right of the vertical bar. If you need to control the scope of the alternation, use round brackets (()) to group expressions.

Example:

Without grouping:

\bcat|dog\b

This regex will match:

A word boundary followed by "cat"
"dog" followed by a word boundary

With grouping:

\b(cat|dog)\b

This regex will match:

A word boundary, then either "cat" or "dog," followed by another word boundary.

Regex	String	Matches
\bcat\|dog\b	"I saw a cat dog"	✅ Yes
\b(cat\|dog)\b	"I saw a cat dog"	✅ Yes

Understanding Regex Engine Behavior

The regex engine is eager, meaning it stops searching as soon as it finds a valid match. The order of alternatives matters.

Consider the pattern:

Get|GetValue|Set|SetValue

When applied to the string "SetValue," the engine will:

Try to match Get, but fail.
Try GetValue, but fail.
Match Set and stop.

The result is that the engine matches "Set," but not "SetValue." This happens because the engine found a valid match early and stopped.

Solutions to Eagerness

There are several ways to address this behavior:

1. Change the Order of Options

By changing the order of options, you can ensure longer matches are attempted first:

GetValue|Get|SetValue|Set

This way, "SetValue" will be matched before "Set."

2. Use Optional Groups

You can combine related options and use ? to make parts of them optional:

Get(Value)?|Set(Value)?

This pattern ensures "GetValue" is matched before "Get," and "SetValue" before "Set."

3. Use Word Boundaries

To ensure you match whole words only, use word boundaries:

\b(Get|GetValue|Set|SetValue)\b

Alternatively, use:

\b(Get(Value)?|Set(Value)?)\b

Or even better:

\b(Get|Set)(Value)?\b

This pattern is more efficient and concise.

POSIX Regex Behavior

Unlike most regex engines, POSIX-compliant regex engines always return the longest possible match, regardless of the order of alternatives. In a POSIX engine, applying Get|GetValue|Set|SetValue to "SetValue" will return "SetValue," not "Set." This behavior is due to the POSIX standard, which prioritizes the longest match.

Summary

Alternation is a powerful feature in regex that allows you to match one of several possible patterns. However, due to the eager behavior of most regex engines, it’s essential to order your alternatives carefully and use grouping to ensure accurate matches. By understanding how the engine processes alternation, you can write more effective and optimized regex patterns.

Table of Contents

Read more...

149 views
0 comments

Word Boundaries (Page 10)

By Jessica Brown
January 9Jan 9
Tutorials

The \b metacharacter is an anchor, similar to the caret (^) and dollar sign ($). It matches a zero-length position called a word boundary. Word boundaries allow you to perform “whole word” searches in a string using patterns like \bword\b.

What is a Word Boundary?

A word boundary occurs at three possible positions in a string:

Before the first character if it is a word character.
After the last character if it is a word character.
Between two characters where one is a word character and the other is a non-word character.

A word character includes letters, digits, and the underscore ([a-zA-Z0-9_]). Non-word characters are everything else.

Example Usage

The pattern \bword\b matches the word "word" only if it appears as a standalone word in the text.

Regex	String	Matches
`\b4\b`	"There are 44 sheets"	No
`\b4\b`	"Sheet number 4 is here"	Yes

Digits are considered word characters, so \b4\b will match a standalone "4" but not when it is part of "44."

Negated Word Boundaries

The \B metacharacter is the negated version of \b. It matches any position that is not a word boundary.

Regex	String	Matches
`\Bis\B`	"This is a test"	No
`\Bis\B`	"This island is beautiful"	Yes

\Bis\B would match "is" only if it appears within a word, such as in "island," but not if it appears as a standalone word.

Looking Inside the Regex Engine

Let’s see how the regex \bis\b works on the string "This island is beautiful":

The engine starts with \b at the first character "T." Since \b is zero-width, it checks the position before "T." It matches because "T" is a word character, and the position before it is the start of the string.
The engine then checks the next token, i, which does not match "T," so it moves to the next position.
The engine continues checking until it finds a match at the second "is." The final \b matches before the space after "is," confirming a complete match.

Tcl Word Boundaries

Most regex flavors use \b for word boundaries. However, Tcl uses different syntax:

\y matches a word boundary.
\Y matches a non-word boundary.
\m matches only the start of a word.
\M matches only the end of a word.

For example, in Tcl:

\mword\M matches "word" as a whole word.

In most flavors, you can achieve the same with \bword\b.

Emulating Tcl Word Boundaries

If your regex flavor supports lookahead and lookbehind, you can emulate Tcl’s \m and \M:

(?<!\w)(?=\w): Emulates \m.
(?<=\w)(?!\w): Emulates \M.

For flavors without lookbehind, use:

\b(?=\w) to emulate \m.
\b(?!\w) to emulate \M.

GNU Word Boundaries

GNU extensions to POSIX regular expressions support \b and \B. Additionally, GNU regex introduces:

\<: Matches the start of a word (like Tcl’s \m).
\>: Matches the end of a word (like Tcl’s \M).

These additional tokens provide flexibility when working with word boundaries in GNU-based tools.

Summary

Word boundaries are crucial for identifying standalone words in text. They prevent partial matches within larger words and ensure more precise regex patterns. Understanding how to use \b, \B, and their equivalents in various regex flavors will help you craft better, more accurate regular expressions.

Table of Contents

Read more...

119 views
0 comments

Start of String and End of String Anchors (Page 9)

By Jessica Brown
January 9Jan 9
Tutorials

In previous sections, we explored how literal characters and character classes operate in regular expressions. These match specific characters in a string. Anchors, however, are different. They match positions in the string rather than characters, allowing you to "anchor" your regex to the start or end of a string or line.

Using the Caret (`^`) Anchor

The caret (^) matches the position before the first character of the string. For example:

^a applied to "abc" matches "a."
^b does not match "abc" because "b" is not the first character of the string.

The caret is useful when you want to ensure that a match occurs at the very beginning of a string.

Example:

Regex	String	Matches
`^a`	"abc"	Yes
`^b`	"abc"	No

Using the Dollar Sign (`$`) Anchor

The dollar sign ($) matches the position after the last character of the string. For example:

c$ matches "c" in "abc."
a$ does not match "abc" because "a" is not the last character.

Example:

Regex	String	Matches
`c$`	"abc"	Yes
`a$`	"abc"	No

Practical Use Cases

Anchors are essential for validating user input. For instance, if you want to ensure a user inputs only an integer number, using \d+ will accept any input containing digits, even if it includes letters (e.g., "abc123").

Instead, use ^\d+$ to enforce that the entire string consists only of digits from start to finish.

Example in Perl:

if ($input =~ /^\d+$/) {
    print "Valid integer";
} else {
    print "Invalid input";
}

To handle potential leading or trailing whitespace, use:

^\s+ to match leading whitespace.
\s+$ to match trailing whitespace.

In Perl, you can trim whitespace like this:

$input =~ s/^\s+|\s+$//g;

Multi-Line Mode

If your string contains multiple lines, you might want to match the start or end of each line instead of the entire string. Multi-line mode changes the behavior of the anchors:

^ matches at the start of each line.
$ matches at the end of each line.

Example:

Given the string:

first line
second line

^s matches "s" in "second line" when multi-line mode is enabled.

Activating Multi-Line Mode

In Perl, use the m flag:

m/^regex$/m;

In .NET, specify RegexOptions.Multiline:

Regex.Match("string", "regex", RegexOptions.Multiline);

In tools like EditPad Pro, GNU Emacs, and PowerGREP, multi-line mode is enabled by default.

Permanent Start and End Anchors

The anchors \A and \Z match the start and end of the string, respectively, regardless of multi-line mode:

\A: Matches only at the start of the string.
\Z: Matches only at the end of the string, before any newline character.
\z: Matches only at the very end of the string, including after a newline character.

For example:

Regex	String	Matches
`\Aabc`	"abc"	Yes
`abc\Z`	"abc\n"	Yes
`abc\z`	"abc\n"	No

Some regex flavors, like JavaScript, POSIX, and XML, do not support \A and \Z. In such cases, use the caret (^) and dollar sign ($) instead.

Zero-Length Matches

Anchors match positions rather than characters, resulting in zero-length matches. For example:

^ matches the start of a string.
$ matches the end of a string.

Example:

Using ^\d*$ to validate a number will accept an empty string. This happens because the regex matches the position at the start of the string and the zero-length match caused by the star quantifier.

To avoid this, ensure your regex accounts for actual input:

^\d+$

Adding a Prefix to Each Line

In some scenarios, you may want to add a prefix to each line of a multi-line string. For example, to prepend a "> " to each line in an email reply, use multi-line mode:

Example in VB.NET:

Dim Quoted As String = Regex.Replace(Original, "^", "> ", RegexOptions.Multiline)

This regex matches the start of each line and inserts the prefix "> " without removing any characters.

Special Cases with Line Breaks

There is an exception to how $ and \Z behave. If the string ends with a line break, $ and \Z match before the line break, not at the very end of the string.

For example:

The string "joe\n" will match ^[a-z]+$ and \A[a-z]+\Z.
However, \A[a-z]+\z will not match because \z requires the match to be at the very end of the string, including after the newline.

Use \z to ensure a match at the absolute end of the string.

Looking Inside the Regex Engine

Let’s see what happens when we apply ^4$ to the string:

749
486
4

In multi-line mode, the regex engine processes the string as follows:

The engine starts at the first character, "7". The ^ matches the position before "7".
The engine advances to 4, and ^ cannot match because it is not preceded by a newline.
The process continues until the engine reaches the final "4", which is preceded by a newline.
The ^ matches the position before "4", and the engine successfully matches 4.
The engine attempts to match $ at the position after "4", and it succeeds because it is the end of the string.

The regex engine reports the match as "4" at the end of the string.

Caution for Programmers

When working with anchors, be mindful of zero-length matches. For example, $ can match the position after the last character of the string. Querying for String[Regex.MatchPosition] may result in an access violation or segmentation fault if the match position points to the void after the string. Handle these cases carefully in your code.

Table of Contents

Read more...

130 views
0 comments

The Dot Matches (Almost) Any Character (Page 8)

By Jessica Brown
January 9Jan 9
Tutorials

The dot, or period, is one of the most versatile and commonly used metacharacters in regular expressions. However, it is also one of the most misused.

The dot matches any single character except for newline characters. In most regex flavors discussed in this tutorial, the dot does not match newlines by default. This behavior stems from the early days of regex when tools were line-based and processed text line by line. In such cases, the text would not contain newline characters, so the dot could safely match any character.

In modern tools, you can enable an option to make the dot match newline characters as well. For example, in tools like RegexBuddy, EditPad Pro, or PowerGREP, you can check a box labeled "dot matches newline."

Single-Line Mode

In Perl, the mode that makes the dot match newline characters is called single-line mode. You can activate this mode by adding the s flag to the regex, like this:

m/^regex$/s;

Other languages and regex libraries, such as the .NET framework, have adopted this terminology. In .NET, you can enable single-line mode by using the RegexOptions.Singleline option:

Regex.Match("string", "regex", RegexOptions.Singleline);

In most programming languages and libraries, enabling single-line mode only affects the behavior of the dot. It has no impact on other aspects of the regex.

However, some languages like JavaScript and VBScript do not have a built-in option to make the dot match newlines. In such cases, you can use a character class like [\s\S] to achieve the same effect. This class matches any character that is either whitespace or non-whitespace, effectively matching any character.

Use The Dot Sparingly

The dot is a powerful metacharacter that can make your regex very flexible. However, it can also lead to unintended matches if not used carefully. It is easy to write a regex with a dot and find that it matches more than you intended.

Consider the following example:

If you want to match a date in mm/dd/yy format, you might start with the regex:

\d\d.\d\d.\d\d

This regex appears to work at first glance, as it matches "02/12/03". However, it also matches "02512703", where the dots match digits instead of separators.

A better solution is to use a character class to specify valid date separators:

\d\d[- /.]\d\d[- /.]\d\d

This regex matches dates with dashes, spaces, dots, or slashes as separators. Note that the dot inside a character class is treated as a literal character, so it does not need to be escaped.

This regex is still not perfect, as it will match "99/99/99". To improve it further, you can use:

[0-1]\d[- /.][0-3]\d[- /.]\d\d

This regex ensures that the month and day parts are within valid ranges. How perfect your regex needs to be depends on your use case. If you are validating user input, the regex must be precise. If you are parsing data files from a known source, a less strict regex might be sufficient.

Use Negated Character Sets Instead of the Dot

Using the dot can sometimes result in overly broad matches. Instead, consider using negated character sets to specify what characters you do not want to match.

For example, to match a double-quoted string, you might be tempted to use:

".*"

At first, this regex seems to work well, matching "string" in:

Put a "string" between double quotes.

However, if you apply it to:

Houston, we have a problem with "string one" and "string two". Please respond.

The regex will match:

"string one" and "string two"

This is not what you intended. The dot matches any character, and the star (*) quantifier allows it to match across multiple strings, leading to an overly greedy match.

To fix this, use a negated character set instead of the dot:

"[^"]*"

This regex matches any sequence of characters that are not double quotes, enclosed within double quotes. If you also want to prevent matching across multiple lines, use:

"[^"\r\n]*"

This regex ensures that the match does not include newline characters.

By using negated character sets instead of the dot, you can make your regex patterns more precise and avoid unintended matches.

Table of Contents

Read more...

128 views
0 comments

Character Classes or Character Sets (Page 7)

By Jessica Brown
January 9Jan 9
Tutorials

Character classes, also known as character sets, allow you to define a set of characters that a regex engine should match at a specific position in the text. To create a character class, place the desired characters between square brackets. For instance, to match either an a or an e, use the pattern [ae]. This can be particularly useful when dealing with variations in spelling, such as in the regex gr[ae]y, which will match both "gray" and "grey."

Key Points About Character Classes:

A character class matches only a single character.
The order of characters inside a character class does not affect the outcome.

For example, gr[ae]y will not match "graay" or "graey," as the class only matches one character from the set at a time.

Using Ranges in Character Classes

You can specify a range of characters within a character class by using a hyphen (-). For example:

[0-9] matches any digit from 0 to 9.
[a-fA-F] matches any letter from a to f, regardless of case.

You can also combine multiple ranges and individual characters within a character class:

[0-9a-fxA-FX] matches any hexadecimal digit or the letter X.

Again, the order of characters inside the class does not matter.

Useful Applications of Character Classes

Here are some practical use cases for character classes:

sep[ae]r[ae]te: Matches "separate" or "seperate" (common spelling errors).
li[cs]en[cs]e: Matches "license" or "licence."
[A-Za-z_][A-Za-z_0-9]*: Matches identifiers in programming languages.
0[xX][A-Fa-f0-9]+: Matches C-style hexadecimal numbers.

Negated Character Classes

By adding a caret (^) immediately after the opening square bracket, you create a negated character class. This instructs the regex engine to match any character not in the specified set.

For example:

q[^u]: Matches a q followed by any character except u.

However, it’s essential to remember that a negated character class still requires a character to follow the initial match. For instance, q[^u] will match the q and the space in "Iraq is a country," but it will not match the q in "Iraq" by itself.

To ensure that the q is not followed by a u, use negative lookahead: q(?!u). We will cover lookaheads later in this tutorial.

Metacharacters Inside Character Classes

Inside character classes, most metacharacters lose their special meaning. However, a few characters retain their special roles:

Closing bracket (])
Backslash (\)
Caret (^) (only if it appears immediately after the opening bracket)
Hyphen (-) (only if placed between characters to specify a range)

To include these characters as literals:

Backslash (\) must be escaped as [\].
Caret (^) can appear anywhere except right after the opening bracket.
Closing bracket (]) can be placed right after the opening bracket or caret.
Hyphen (-) can be placed at the start or end of the class.

Examples:

[x^] matches x or ^.
[]x] matches ] or x.
[^]x] matches any character that is not ] or x.
[-x] matches x or -.

Shorthand Character Classes

Shorthand character classes are predefined character sets that simplify your regex patterns. Here are the most common shorthand classes:

Shorthand	Meaning	Equivalent Character Class
\d	Any digit	[0-9]
\w	Any word character	[A-Za-z0-9_]
\s	Any whitespace character	[ \t\r\n]

Details:

\d matches digits from 0 to 9.
\w includes letters, digits, and underscores.
\s matches spaces, tabs, and line breaks. In some flavors, it may also include form feeds and vertical tabs.

The characters included in these shorthand classes may vary depending on the regex flavor. For example:

JavaScript treats \d and \w as ASCII-only but includes Unicode characters for \s.
XML handles \d and \w as Unicode but limits \s to ASCII characters.
Python allows you to control what the shorthand classes match using specific flags.

Shorthand character classes can be used both inside and outside of square brackets:

\s\d matches a whitespace character followed by a digit.
[\s\d] matches a single character that is either whitespace or a digit.

For instance, when applied to the string "1 + 2 = 3":

\s\d matches the space and the digit 2.
[\s\d] matches the digit 1.

The shorthand [\da-fA-F] matches a hexadecimal digit and is equivalent to [0-9a-fA-F].

Negated Shorthand Character Classes

The primary shorthand classes also have negated versions:

\D: Matches any character that is not a digit. Equivalent to [^\d].
\W: Matches any character that is not a word character. Equivalent to [^\w].
\S: Matches any character that is not whitespace. Equivalent to [^\s].

Be careful when using negated shorthand inside square brackets. For example:

[\D\S] is not the same as [^\d\s].
- [\D\S] will match any character, including digits and whitespace, because a digit is not whitespace and whitespace is not a digit.
- [^\d\s] will match any character that is neither a digit nor whitespace.

Repeating Character Classes

You can repeat a character class using quantifiers like ?, *, or +:

[0-9]+: Matches one or more digits and can match "837" as well as "222".

If you want to repeat the matched character instead of the entire class, you need to use backreferences:

([0-9])\1+: Matches repeated digits, like "222," but not "837."
- Applied to the string "833337," this regex matches "3333."

If you want more control over repeated matches, consider using lookahead and lookbehind assertions, which we will explore later in the tutorial.

Looking Inside the Regex Engine

As previously discussed, the order of characters inside a character class does not matter. For instance, gr[ae]y can match both "gray" and "grey."

Let’s see how the regex engine processes gr[ae]y step by step:

Given the string:

"Is his hair grey or gray?"

The engine starts at the first character and fails to match g until it reaches the 13th character.
At the 13th character, g matches.
The next token r matches the following character.
The character class [ae] gives the engine two options:
- First, it tries a, which fails.
- Then, it tries e, which matches.
The final token y matches the next character, completing the match.

The engine returns "grey" as the match result and stops searching, even though "gray" also exists in the string. This is because the regex engine is eager to report the first valid match it finds.

Understanding how the regex engine processes character classes helps you write more efficient patterns and predict match results more accurately.

Table of Contents

Read more...

130 views
0 comments

First Look at How a Regex Engine Works Internally (Page 6)

By Jessica Brown
January 9Jan 9
Tutorials

Understanding how a regex engine processes patterns can significantly improve your ability to write efficient and accurate regular expressions. By learning the internal mechanics, you’ll be better equipped to troubleshoot and refine your regex patterns, reducing frustration and guesswork when tackling complex tasks.

Types of Regex Engines

There are two primary types of regex engines:

Text-Directed Engines (also known as DFA - Deterministic Finite Automaton)
Regex-Directed Engines (also known as NFA - Non-Deterministic Finite Automaton)

All the regex flavors discussed in this tutorial utilize regex-directed engines. This type is more popular because it supports features like lazy quantifiers and backreferences, which are not possible in text-directed engines.

Examples of Text-Directed Engines:

awk
egrep
flex
lex
MySQL
Procmail

Note: Some versions of awk and egrep use regex-directed engines.

How to Identify the Engine Type

To determine whether a regex engine is text-directed or regex-directed, you can apply a simple test using the pattern:

regex|regex not

Apply this pattern to the string "regex not":

If the result is "regex", the engine is regex-directed.
If the result is "regex not", the engine is text-directed.

The difference lies in how eager the engine is to find matches. A regex-directed engine is eager and will report the leftmost match, even if a better match exists later in the string.

The Regex-Directed Engine Always Returns the Leftmost Match

A crucial concept to grasp is that a regex-directed engine will always return the leftmost match. This behavior is essential to understand because it affects how the engine processes patterns and determines matches.

How It Works

When applying a regex to a string, the engine starts at the first character of the string and tries every possible permutation of the regex at that position. If all possibilities fail, the engine moves to the next character and repeats the process.

For example, consider applying the pattern «cat» to the string:

"He captured a catfish for his cat."

Here’s a step-by-step breakdown:

The engine starts at the first character "H" and tries to match "c" from the pattern. This fails.
The engine moves to "e", then space, and so on, failing each time until it reaches the fourth character "c".
At "c", it tries to match the next character "a" from the pattern with the fifth character of the string, which is "a". This succeeds.
The engine then tries to match "t" with the sixth character, "p", but this fails.
The engine backtracks and resumes at the next character "a", continuing the process.
Finally, at the 15th character in the string, it matches "c", then "a", and finally "t", successfully finding a match for "cat".

Key Point

The engine reports the first valid match it finds, even if a better match could be found later in the string. In this case, it matches the first three letters of "catfish" rather than the standalone "cat" at the end of the string.

Why?

At first glance, the behavior of the regex-directed engine may seem similar to a basic text search routine. However, as we introduce more complex regex tokens, you’ll see how the internal workings of the engine have a profound impact on the matches it returns.

Understanding this behavior will help you avoid surprises and leverage the full power of regex for more effective and efficient text processing.

Table of Contents

Read more...

148 views
0 comments

Sign In

Welcome to CodeNameJessica

✨ Welcome to CodeNameJessica! ✨

Entries

Comments

Views

Entries in this blog

Regular Expressions Tutorial Table of Contents

Level 1 - The Foundations: Understanding JavaScript Basics

Level 2 - Building Blocks: DOM Manipulation & Event Handling

Level 3 - Advancing Forward: Asynchronous JavaScript & APIs

Level 4 - Professional Development: Object-Oriented & Functional Programming

Level 5 - Expert Craftsmanship: Performance Optimization & Design Patterns

Level 6 - The Extreme Zone: Meta-Programming & JavaScript Internals

Build a Hardened and Secure Linux Server (Level 1)

Strengthen and Secure Your Linux Server (Level 2)

Comprehensive Linux Server Hardening and Security Implementation (Level 3)

Advanced Linux Server Security and Threat Mitigation (Level 4)

Enterprise-Grade Linux Security and Defense Mechanisms (Level 5)

Maximum Security and Compliance-Driven Hardening (Level 6)

How to Enable Free-Spacing Mode

What Does Free-Spacing Mode Do?

Grouping Modifiers and Special Constructs

Character Classes in Free-Spacing Mode

Important Notes for Java

Adding Comments in Free-Spacing Mode

Which Regex Engines Support Free-Spacing Mode?

Summary of Key Rules for Free-Spacing Mode

Putting It All Together: A Date Matching Example

How to Add Comments in Regular Expressions

Benefits of Using Comments in Regular Expressions

Using Free-Spacing Mode for Better Formatting

Without Free-Spacing Mode:

With Free-Spacing Mode and Comments:

Which Regex Engines Support Comments?

Example: Using Comments to Document a Complex Regex

Key Points to Remember

When to Use Comments and Free-Spacing Mode

Key Differences Between POSIX Bracket Expressions and Standard Character Classes

POSIX Character Classes

Common POSIX Character Classes and Their Equivalents

Using POSIX Bracket Expressions with Negation

Collating Sequences in POSIX Locales

Character Equivalents in POSIX Locales

Best Practices for POSIX Bracket Expressions

Special Character Classes in XML Schema

Practical Examples Using XML Schema Character Classes

Character Class Subtraction in XML Schema

Nested Character Class Subtraction

Important Rules for Class Subtraction

Notational Compatibility with Other Regex Flavors

Best Practices for XML Schema Regex

Summary of XML Schema Regex Features

Conditional Syntax and How It Works

Using Lookahead in Conditionals

Using Capturing Groups in Conditionals

Example Walkthrough: (a)?b(?(1)c|d)

Optimizing the Pattern with Anchors

Conditionals in Different Regex Engines

Example: Extracting Email Headers with Conditionals

Simplifying Complex Patterns

Summary

How the \G Anchor Works

Key Difference: End of Previous Match vs. Start of Match Attempt

Using \G in Perl

Important Tip: Using the /c Modifier

Example: Parsing an HTML File with \G in Perl

\G in Other Programming Languages

Summary

Why Lookaround Is Essential

Using Lookahead to Match Multiple Requirements

Optimizing the Regex

A More Complex Example

Breaking It Down:

What Are Zero-Width Assertions?

Positive and Negative Lookahead

Positive Lookahead ((?=...))

Negative Lookahead ((?!...))

Positive and Negative Lookbehind

Positive Lookbehind ((?<=...))

Example Walkthrough: `(a)?b(?(1)c|d)`

How the `\G` Anchor Works

Using `\G` in Perl

Important Tip: Using the `/c` Modifier

Example: Parsing an HTML File with `\G` in Perl

`\G` in Other Programming Languages

Positive Lookahead (`(?=...)`)

Negative Lookahead (`(?!...)`)

Positive Lookbehind (`(?<=...)`)

Negative Lookbehind (`(?<!...)`)