General

RegEx in 10min – A beginner course

person encoding in laptop

Introduction to Regular Expressions

Regular expressions (or “regex” for short) are a powerful tool used to match certain patterns in text. Regex is widely used in many programming languages, text editors, search engines and command-line tools, making it a valuable asset in any programmer’s skill set. In this chapter, we will cover the basics of regex and discuss its common terminologies. We’ll also cover some of the most common use cases and applications of regex.

A regular expression (or regex) is a string of characters or symbols which can be used to match or search for patterns in text. The patterns are defined by rules which specify how sets of characters should be combined and organized in order to achieve a desired result. Additionally, regular expressions allow users to capture specific text elements or sections of a greater body of text. Regular expressions have applications in many computer programming languages, such as Python, Java, JavaScript, and Perl. The following explanations should work with all applications, since RegEx is a universal language for pattern matching.

Regular expressions have a wide range of uses, from data validation and text processing to pattern matching in search engines. Regex can be used to verify data before it is stored or processed, ensuring that only valid data is accepted and preventing errors from occurring due to incorrect input. Regex can also be used to search for specific patterns in text, allowing users to find words, phrases, numbers, or any other type of desired element quickly and easily. Additionally, regex can be used to capture specific elements of text and store them for later use, such as when extracting information from user-provided forms or web scraping.

When working with regular expressions, it is important to understand some of the basic terminology used in the context of regex. Below are some of the most commonly used terms and their definitions:

Pattern: A pattern is a set of characters that form a logical entity when organized together. Patterns can be used to match or search for text elements in a larger body of text.

Character: A character is any visible symbol, letter, number, punctuation mark, or space. Characters are the building blocks of regular expressions and are used to create patterns.

Group: A group is a collection of characters that makes up a single pattern. Groups are often composed of several subpatterns, each of which is made up of one or more characters.

Subpattern: A subpattern is a smaller part of a larger pattern. It is composed of one or more characters and contributes to the overall meaning of the pattern.

Anchor: An anchor is a special character that is used to denote the start or end of a line. Anchors can also be used to match certain characters at the beginning or end of a pattern.

Wildcard: A wildcard is a special character which represents any character or set of characters. Wildcards are used to match multiple characters in a pattern.

Understanding the Basics of Regular Expressions

​​In a regular expression, there are various components that make up each statement. These components are known as metacharacters and they serve to provide further information about how the pattern should be interpreted. The most basic element of a regular expression is the literal character, which is any single character that is read as-is (such as an A, B, or C). Other commonly used metacharacters include:

CharacterNameDescription
\BackslashUsed to escape a special character
^CaretBeginning of a string
$Dollar signEnd of a string
.Period or dotMatches any single character
|Pipe symbolMatches previous OR next character/group
?Question markMatch zero or one of the previous
*Asterisk or starMatch zero, one or more of the previous
+Plus signMatch one or more of the previous
( )Opening and closing parenthesisGroup characters
[ ]Opening and closing square bracketMatches a range of characters
{ }Opening and closing curly braceMatches a specified number of occurrences of the previous

In addition to the above symbols, there are also some less common metacharacters such as \b (backspace), \B (non-word boundary), \d (digit), \D (non-digit), \s (whitespace), \S (non-whitespace), and so on. It is important to understand how these metacharacters interact with each other to produce a specific result.

The ^ symbol is used to mark the start of the expression, while the $ symbol indicates the end. This can be useful for determining whether the string is starting with a specific letter or phrase. For example, the expression “^Hello” will match only strings that start with “Hello”. The . symbol is used to represent any single character, and is also used with the * symbol to indicate any number of characters. The | symbol is used to indicate a choice between two distinct parts of an expression, while [] is used to represent any one of a set of characters. Lastly, () is used to group a set of characters into a subpattern.

Here is a more complete list for quick reference.

ComponentDescription
\wMatches any word character
\WMatches any non-word character
\dMatches any digit character
\DMatches any non-digit character
\sMatches any whitespace character
\SMatches any non-whitespace character
\bSpecifies a word boundary
\BSpecifies a non-word boundary
{n}Matches exactly n occurrences of the preceding character
{n,}Matches at least n occurrences of the preceding character
{n,m}Matches any number between n and m (inclusive) of the preceding character
[^…]Matches any character not in the specified range

These components can be used in combination to create more complex patterns. For example, the expression \d\w{3} would match any four-character string that starts with a digit character followed by three word characters. Similarly, the expression [^A-Z]{4} would match any four-character string that does not contain an uppercase letter.

By combining these metacharacters together, it is possible to create complex and sophisticated expressions that can be used to match virtually any pattern. Understanding how these components interact together is a key part of mastering the use of regular expressions.

Examples

\d{5,8}
This pattern matches any sequence of 5 to 8 numbers

^[A-Z].+
Matches any string that starts with an uppercase letter

^[a-zA-Z]*$
This pattern validates whether a string only contains letters of the alphabet

<.*>
This pattern matches any HTML tags in a string

([A-Za-z0-9]+)-([A-Za-z0-9]+)
Matches any mix of alphanumeric characters with a hyphen between them

[A-Z]+\s?[A-Z]*
This pattern matches strings containing one or more upper-case characters and an optional space before the next upper-case character

^[A-Za-z0-9_.-]+@[A-Za-z0-9_.-]+\.[A-Za-z0-9_.-]+$
This pattern matches any valid email address

(\w+)\s(?=\1)
This pattern matches any word that is followed by the same word

\(\d{3}\) \d{3}-\d{4}
This pattern matches a U.S. phone number in the specific format of “(xxx) xxx-xxxx”. The parentheses and hyphen are treated as literal characters and must be included for the pattern to match

#[0-9a-fA-F]{6}
This pattern matches a hexadecimal color code in the format “#xxxxxx”, where each x is a hexadecimal digit

Tips to create a good RegEx pattern

Creating good regular expressions patterns can seem daunting at first, but they are actually quite simple to understand. The following tips will help you create the best RegEx pattern for your needs:

  1. Clearly define the scope of the pattern. What do you want it to match? Be as specific as possible.
  2. Take advantage of the built-in regex modifiers such as \w, \s, and \d. These are powerful tools that can help you construct complex patterns with ease.
  3. Avoid overcomplicated patterns. A complex pattern may fit your use case well now, but it could be difficult to maintain and debug down the road.
  4. Test your pattern. Before using it. My favorite place to test patterns is regex101.com

Congratulations! You made it until the end. I hope you understand RegEx now and you are able to start creating your own patterns. If you have questions, please write me a comment.

Leave a Reply