Regular Expressions
Regular expressions (regex or regexp) are sequences of characters that define search patterns. They are commonly used for data validation and text search and replace. Today, we will briefly cover the essential elements of regular expressions and how to construct them.
Literals
Literals are characters without special meaning. Among these characters, we find letters, numbers, and some symbols. Basically, a character is literal unless it is a metacharacter.
You can search for literal characters simply by writing the characters in question. For example:
hello
It will match hello.
20% off
It will match 20% off.
Your token is %&4ñ1@
It will match Your token is %&4ñ1@.
Metacharacters
Metacharacters are non-alphanumeric characters with special meaning. Their purpose is to support pattern matching. These characters are:
^ . [ ] $ ( ) | * + ? { } \
The difference with these characters lies, as their definition states, in their special meaning. Most of these characters will not make a match in a search unless they are escaped. Let's see these examples:
The characters \[\] define character classes
It will match The characters [] define character classes.
The expression \(2n \+ 2\)\/2 always results in an even number
It will match The expression (2n + 2)/2 always results in an even number.
Furthermore, these characters can be grouped according to their functionality, as we will see next.
Quantifiers
Quantifiers are special characters used after a token to indicate the number of times it can occur.
Character | Meaning |
---|---|
? | Indicates zero or one occurrence |
* | Indicates zero or more occurrences |
+ | Indicates one or more occurrences |
{n} | Indicates exactly n occurrences |
{n,} | Indicates minimum n occurrences or more |
{n,m} | Indicates minimum n occurrences and maximum m occurrences |
Let's see some examples:
registers?
It will match register, registers.
ox*o
It will match oo, oxo, oxxo, ...
ho+la
Will match hola, hoola, hooola, ...
FU4{2}1
It will match FU441.
F3{2,}X
It will match F33X, F333X, F3333X, ...
78{2,3}TX
It will match 788TX, 7888TX.
Wildcard
The dot character (.) will match a single character. Let's see some examples.
Z4.Q
It will match Z4aQ, Z4bQ, Z4LQ, Z4%Q, Z4*Q, ... even with the space character Z4 Q.
Character Classes or Sets
The characters ([ ]) define a set of characters as a pattern. The match will occur with each character within the set. Let's see some examples.
[abc]
It will match a, b, c.
[ab12]
It will match a, b, 1, 2.
[a-z]
It will match any individual lowercase letter of the alphabet: a, b, c, ... z.
[a-zA-Z]
It will match any individual lowercase or uppercase letter of the alphabet: a, b, c, ... z, A, B, C, ... Z.
[0-9]
It will match any digit: 0, 1, 2, ... 9.
We can combine quantifiers with these character classes to indicate the number of times any of the elements in the group can occur.
registro[s]?
It will match registro, registros.
[abc]{3}
It will match abc, acb, bac, bca, cab, cba.
X[123456789]+
It will match X1, X2, X3, ... X11, X12, X13, ... X99, ... X999, ...
a[a-z]?
It will match a, aa, ab, ac, ... az.
You can also perform negation or the opposite of a shorthand character class by adding the caret (^) symbol.
[^0-9]
It will match any non-digit character: f, %, r, ... &.
Shorthand Character Classes
Because there are commonly used character classes, there are abbreviations to represent these classes, which we will see below.
Abbreviation | Character Class | Match |
---|---|---|
\d | [0-9] | Digit |
\D | [^0-9] | Non-digit |
\s | [ \t] | Whitespace or tab |
\S | [^ \t] | Non-whitespace and non-tab |
\w | [A-Za-z0-9_] | Alphanumeric characters plus "_" |
\W | [^A-Za-z0-9_] | Non-alphanumeric and non-underscore |
Here are two simple examples. You can easily infer the rest.
\d
It will match any digit: 0, 1, 2, ... 9.
\D
It will match any non-digit character: a, R, %, ... #.
Some text editing programs may add specific shorthand character classes. This is the case in Vim with the following abbreviations.
Abbreviation | Character Class | Match |
---|---|---|
\l | [a-z] | Lowercase letters |
\u | [A-Z] | Uppercase letters |
\a | [a-zA-Z] | Alphabetic characters |
Groups
The characters ( ( ) ) allow you to group segments of a regular expression in order to apply quantifiers or alternation constraints to the entire group. Let's see the following examples.
(abc)+
It will match abc, abcabc, abcabc, ...
([a-z][0-9])+
It will match letter-digit pairs one or more times: a9, x2l6, j4t7p5, u8u7d8e8, ...
([0-9]{2}){2}
It will match four arbitrary digits: 0000, 0109, 2019, 8439, ...
[0-9]+([.][0-9]+)?
It will match any integer or decimal number: 10, 7.8, 14.5813, 0.366, ...
Alternation
Alternation allows you to match a simple regular expression from a list of regular expressions. Let's see some examples.
a|b
It will match a or b.
hola|hello|salut
It will match hola, hello, salut.
[a-z]|[0-9]
It will match a letter or digit: a, Z, r, 8, 6, ...
#([a-z]|[0-9])#
It will match a letter or digit enclosed by the # character: #a#, #Z#, #r#, #8#, #6#, ...
Anchors
Anchors do not match any specific characters; their meaning is purely positional. These metacharacters indicate the position where a match should occur.
Character | Meaning |
---|---|
^ | Must match at the beginning of the string |
$ | Must match at the end of the string |
So far, we have seen exact matches. However, it is worth noting that matches can occur within a broader context. Let's consider the following expression.
[a-z]+
This expression will match any lowercase word, for example: hello, wait, text, computer, etc. However, this match can occur within the context of a sentence.
Regular expressions are simple.
In the previous sentence, we have three matches with our regular expression. The first word contains an uppercase letter, so it does not match. Now, if we use anchors to delimit the position of our match as ^[a-z]+$, the match would be none. This is because we are indicating that the string to be evaluated must start with a lowercase letter and end with a lowercase letter. Let's see more examples.
^[a-zA-Z]+
This expression will match the first word in a string.
Three sad tigers
Hello world
Regular expressions are simple
588 messages read
Note that the last line does not generate a match because it starts with a number.