Abbreviations of Word Limits in Regular Expressions
In our previous post we studied Regular Expressions. We saw quantifiers, classes, abbreviated classes, alternation, groups, and other interesting concepts. Within the class abbreviations there are two particularly that we did not see because they are a little more complex than the others, these are, word boundary abbreviations.
Abbreviation | Match |
---|---|
\b | A word boundary |
\B | A non-word boundary |
Let's start with the word boundary (\b). Like the (^$) anchors, this class abbreviation does not produce a match with the evaluated string. Rather, it acts as a positional delimiter to indicate a word break or boundary. Let's look at the following examples.
[0-9]+\b
The plate F58K expired in 2019.
Note that the only digit match is given by the mentioned year. This is because after the number 58 there are word characters (letters, numbers, or underscore), that is, there is no word break. However, after the number 2019 there is a dot, which is not a word character, so the match is produced.
\b[a-z]+
Two and two _are %four.
In the previous expression we have placed the word boundary at the beginning of the expression. The words "and" and "two" produce a match because there is a space before them, which is a word break, that is, it is not a word. However, the word "four" "_are" does not produce a match because there is a character considered to be a word, which is the underscore, before it. The word "four" is a match because the percent sign % is not a word character.
As you can see, this class abbreviation is focused on word detection and allows you to quickly find the words in a text.
\b\w+\b
Lorem Ipsum simply dummy text of the printing and typesetting industry.
The opposite abbreviation \B will delimit with a non-word character, that is, any character that is a letter, number, or underscore.