Regular Expressions

By Darío Rivera

Posted On 2023-07-10 in RegExp

Regular expressions (regex or regexp) are sequences of characters that define search patterns. They are commonly used for data validation and text search and replace. Today, we will briefly cover the essential elements of regular expressions and how to construct them.

Literals

Literals are characters without special meaning. Among these characters, we find letters, numbers, and some symbols. Basically, a character is literal unless it is a metacharacter.

You can search for literal characters simply by writing the characters in question. For example:

hello
It will match hello.

20% off
It will match 20% off.

Your token is %&4ñ1@
It will match Your token is %&4ñ1@.

Metacharacters

Metacharacters are non-alphanumeric characters with special meaning. Their purpose is to support pattern matching. These characters are:

^ . [ ] $ ( ) | * + ? { } \

The difference with these characters lies, as their definition states, in their special meaning. Most of these characters will not make a match in a search unless they are escaped. Let's see these examples:

The characters \[\] define character classes
It will match The characters [] define character classes.

The expression $2n \+ 2$\/2 always results in an even number
It will match The expression (2n + 2)/2 always results in an even number.

Furthermore, these characters can be grouped according to their functionality, as we will see next.

Quantifiers

Quantifiers are special characters used after a token to indicate the number of times it can occur.

Character	Meaning
?	Indicates zero or one occurrence
*	Indicates zero or more occurrences
+	Indicates one or more occurrences
{n}	Indicates exactly n occurrences
{n,}	Indicates minimum n occurrences or more
{n,m}	Indicates minimum n occurrences and maximum m occurrences

Let's see some examples:

registers?
It will match register, registers.

ox*o
It will match oo, oxo, oxxo, ...

ho+la
Will match hola, hoola, hooola, ...

FU4{2}1
It will match FU441.

F3{2,}X
It will match F33X, F333X, F3333X, ...

78{2,3}TX
It will match 788TX, 7888TX.

Wildcard

The dot character (.) will match a single character. Let's see some examples.

Z4.Q
It will match Z4aQ, Z4bQ, Z4LQ, Z4%Q, Z4*Q, ... even with the space character Z4 Q.

Character Classes or Sets

The characters ([ ]) define a set of characters as a pattern. The match will occur with each character within the set. Let's see some examples.

[abc]
It will match a, b, c.

[ab12]
It will match a, b, 1, 2.

[a-z]
It will match any individual lowercase letter of the alphabet: a, b, c, ... z.

[a-zA-Z]
It will match any individual lowercase or uppercase letter of the alphabet: a, b, c, ... z, A, B, C, ... Z.

[0-9]
It will match any digit: 0, 1, 2, ... 9.

We can combine quantifiers with these character classes to indicate the number of times any of the elements in the group can occur.

registro[s]?
It will match registro, registros.

[abc]{3}
It will match abc, acb, bac, bca, cab, cba.

X[123456789]+
It will match X1, X2, X3, ... X11, X12, X13, ... X99, ... X999, ...

a[a-z]?
It will match a, aa, ab, ac, ... az.

You can also perform negation or the opposite of a shorthand character class by adding the caret (^) symbol.

[^0-9]
It will match any non-digit character: f, %, r, ... &.

Shorthand Character Classes

Because there are commonly used character classes, there are abbreviations to represent these classes, which we will see below.

Abbreviation	Character Class	Match
\d	[0-9]	Digit
\D	[^0-9]	Non-digit
\s	[ \t]	Whitespace or tab
\S	[^ \t]	Non-whitespace and non-tab
\w	[A-Za-z0-9_]	Alphanumeric characters plus "_"
\W	[^A-Za-z0-9_]	Non-alphanumeric and non-underscore

Here are two simple examples. You can easily infer the rest.

\d
It will match any digit: 0, 1, 2, ... 9.

\D
It will match any non-digit character: a, R, %, ... #.

Some text editing programs may add specific shorthand character classes. This is the case in Vim with the following abbreviations.

Abbreviation	Character Class	Match
\l	[a-z]	Lowercase letters
\u	[A-Z]	Uppercase letters
\a	[a-zA-Z]	Alphabetic characters

Groups

The characters ( ( ) ) allow you to group segments of a regular expression in order to apply quantifiers or alternation constraints to the entire group. Let's see the following examples.

(abc)+
It will match abc, abcabc, abcabc, ...

([a-z][0-9])+
It will match letter-digit pairs one or more times: a9, x2l6, j4t7p5, u8u7d8e8, ...

([0-9]{2}){2}
It will match four arbitrary digits: 0000, 0109, 2019, 8439, ...

[0-9]+([.][0-9]+)?
It will match any integer or decimal number: 10, 7.8, 14.5813, 0.366, ...

Alternation

Alternation allows you to match a simple regular expression from a list of regular expressions. Let's see some examples.

a|b
It will match a or b.

hola|hello|salut
It will match hola, hello, salut.

[a-z]|[0-9]
It will match a letter or digit: a, Z, r, 8, 6, ...

#([a-z]|[0-9])#
It will match a letter or digit enclosed by the # character: #a#, #Z#, #r#, #8#, #6#, ...

Anchors

Anchors do not match any specific characters; their meaning is purely positional. These metacharacters indicate the position where a match should occur.

Character	Meaning
^	Must match at the beginning of the string
$	Must match at the end of the string

So far, we have seen exact matches. However, it is worth noting that matches can occur within a broader context. Let's consider the following expression.

[a-z]+
This expression will match any lowercase word, for example: hello, wait, text, computer, etc. However, this match can occur within the context of a sentence.

Regular expressions are simple.

In the previous sentence, we have three matches with our regular expression. The first word contains an uppercase letter, so it does not match. Now, if we use anchors to delimit the position of our match as ^[a-z]+$, the match would be none. This is because we are indicating that the string to be evaluated must start with a lowercase letter and end with a lowercase letter. Let's see more examples.

^[a-zA-Z]+
This expression will match the first word in a string.

Three sad tigers
Hello world
Regular expressions are simple
588 messages read

Note that the last line does not generate a match because it starts with a number.