Chanaka's Blog | Regular Expressions Demystified

Overview

Regular expressions are one of the most fundamental tools used in Natural Language processing.
Used to search for strings using a pattern.
Platform independent: same regex used in your favorite programming language can be used in a text editor like VSCode.
With regular expressions you can return a single match or an entire list of them that match the pattern.
There are many variants of regular expressions. This article will dwell into extended regular expressions.
Regular expressions are case sensitive. (Therefore, One != one)
Regular expressions are greedy by default.

Follow along

An interactive python notebook with code referencing this examples can be found at: Google Colab

Sequences

An ordered combination of one or more characters.

Ex:

For the sentence: once upon a time in oncelot, there was once a roncean

Expression	Matches	Meaning
once	4: once upon a time in oncelot, there was once a roncean	characters o,n,c,e in order

Disjunction

Disjunction allows us to search for strings with variations.
The number of possible variations that are allowed for a string are enclosed in square brackets. ([])
For example, if we have a string: "Natural habitats with natural biology are affected by natural disasters".
There are a few characters that when used within a square bracket give a special meaning: "^", "-".
-: is used to define a range.
For example, if we want to search for all numbers we can use: [0123456789].
However, this is quite bothersome and tiring, this is where we can use: [0-9] to signify a range of 0 to 9.
^: Used to denote negation.
^ has to be the first character within square brackets for it to count as negation.
If ^ appears at a different position within square brackets such as the middle it denotes a normal "^" sign.

Ex:

For the text: Natural habitats with natural biology are affected by natural disasters

Expression	Matches	Meaning
natural	2: Natural habitats with natural biology are affected by natural disasters	sequence of "natural"
Natural	1: Natural habitats with natural biology are affected by natural disasters	sequence of "Natural"
[nN]atural	3: Natural habitats with natural biology are affected by natural disasters	sequence of both "Natural" and "natural"

For the text: abcAZ907

Expression	Matches	Meaning
[abcdefghijklmnopqrstuvxyz]	3: abcAZ907	any lower case character
[a-z]	3: abcAZ907	any lowercase character
a[a-z]	1: abcAZ907	a followed by any lowercase character
[7-9]	2: abcAZ907	digits 7, 8 or 9

For the text: abcAZ907

Expression	Matches	Meaning
[^abcdefghijklmnopqrstuvxyz]	6: abcAZ907^	any character that is not a lowercase letter
[^a-z]	6: abcAZ907^	any character that is not a lowercase letter
[7^9]	3: abcAZ907^	digits 7 or 9 or character ^
[^7^9]	6: abcAZ907^	any character other than digits 7 or 9 or character ^

Counters

Counters allow us to define patterns with a variable number of a characters.
?: Zero or one instance.
*: Zero or more instances
+: One or more instances
{number}: Specific number of instances

Ex:

For the text: The word color is sometimes referred to as colour in some interpretations.

Expression	Matches	Meaning
color	1: The word color is sometimes referred to as colour in some interpretations.	sequence of color
colour	1: The word color is sometimes referred to as colour in some interpretations.	sequence of colour
colou?r	2: The word color is sometimes referred to as colour in some interpretations.	sequence of color or colour

For the expression: ba*!

Text	Matches	Meaning
b!	1: b!	character 'b' followed by zero or more 'a' followed by '!'
ba!	1: ba!	character 'b' followed by zero or more 'a' followed by '!'
baaaaaa!	1: baaa!	character 'b' followed by zero or more 'a' followed by '!'

For the expression: ba+!

Text	Matches	Meaning
b!	0: b!	character 'b' followed by one or more 'a' followed by '!'
ba!	1: ba!	character 'b' followed by one or more 'a' followed by '!'
baaaaaa!	1: baaa!	character 'b' followed by one or more 'a' followed by '!'

For the text: baaaaa!

Expression	Matches	Meaning
ba{5}!	1: baaaaa!	character b followed by 5 a's followed by !
ba{4}!	0: baaaaa!	character b followed by 4 a's followed by !
ba{3, 6}!	1: baaaaa!	character b followed by a minimum of 3 a's upto a maximum of 6 a's followed by !
ba{4, }!	1: baaaaa!	character b followed by a minium of 4 a's followed by !
ba{, 3}!	0: baaaaa!	character b followed by a maximum of 3 a's followed by !
ba{, 6}!	1: baaaaa!	character b followed by a maximum of 6 a's followed by !

Anchors

Regular expressions used to denote a particular place for a string.
^: Used to denote the start of a line.
$: Used to denote the end of a line.
\b: Used to denote a word boundary.
\B: Used to denote a non word boundary.
A word in regex: sequence of letters, digits and underscores(_).
Thus, special characters like "$" or "!" will not count as a word boundary

Ex:

For python specifically, you need to enable the multiline flag to identify new lines:

Regex Setup for processing new line characters

flags = re.MULTILINE if multiline else 0
compiled_pattern = re.compile(pattern, flags)
matches = compiled_pattern.finditer(text)

For the text: There was the person that was running t^he shop and they came and went.\nThe shop opened next day t\nhe

Expression	Matches	Meaning
^[Tt]he	2: There was the person that was running t^he shop and they came and went.The shop opened next day the	Line start with The or the
t^he	0: There was the person that was running t^he shop and they came and went.The shop opened next day the	This should be the sequence "t^he" but in python it needs to be escaped
t\^he	1: There was the person that was running t^he shop and they came and went.The shop opened next day the	The sequence 't^he'
ent.$	1: There was the person that was running t^he shop an$d they came and $99 went.The shop opened next day the	Sentence that ends with the sequence ent.
he$	1: There was the person that was running t^he shop an$d they came and $99 went. The shop opened next day the	Sentence that ends with the sequence "he"
\$99	1: There was the person that was running t^he shop an$d they came and $99 went. The shop opened next day the	The sequence '$99'. $ needs to be escaped
$99	0: There was the person that was running t^he shop an$d they came and $99 went. The shop opened next day the	Will not match anything $ needs to be escaped
an$d	1: There was the person that was running t^he shop an$d they came and $99 went.The shop opened next day the	The sequence 'an$d'

For the text: The man wrote a thesis on rafther and there were others who read it and the people who read it were satisfied. The_ $99! price was on$the tag with an id code of _99_ 99a.

Expression	Matches	Meaning
\b[Tt]he\b	3: The man wrote a thesis on rafther and there were others who read it and the people who read it were satisfied. The_ $99! price was on$the tag with an id code of _99_ 99a.	sequence of 'The' or 'the' with no letters, digits or underscores on either side of the sequence separated by a whitespace
\B[Tt]he\B	2: The man wrote a thesis on rafther and there were others who read it and the people who read it were satisfied. The_ $99! price was on$the tag with an id code of _99_ 99a.	sequence of 'The' or 'the' such that there are letter, digits or underscores on either side
\b99\b	1: The man wrote a thesis on rafther and there were others who read it and the people who read it were satisfied. The_ $99! price was on$the tag with an id code of _99_ 99a.	sequence of '99' without letters, digits or underscores next to the left or right of the sequence separated by a whitespace
\B99\B	1: The man wrote a thesis on rafther and there were others who read it and the people who read it were satisfied. The_ $99! price was on$the tag with an id code of _99_ 99a.	sequence of 99 such that there are letters, digits or underscores next to the left or right of the sequence
\bthe\B	2: The man wrote a thesis on rafther and there were others who read it and the people who read it were satisfied. The_ $99! price was on$the tag with an id code of _99_ 99a.	sequence of the such that there are no letters, digits or underscores to the left before a whitespace character and a letter, digit or underscore to the right before a whitespace

Wildcard character

. : Used to denote the possibility of any character

Ex:

For the expression te.m

Text	Matches	Meaning
team	1: team	sequence of 'te' followed by any character followed by 'm'
te!m	1: te!m	sequence of 'te' followed by any character followed by 'm'
tem	0: tem	sequence of 'te' followed by any character followed by 'm'
te m	1: te m	sequence of 'te' followed by any character followed by 'm'
te am	0: te am	sequence of 'te' followed by any character followed by 'm'

For the expression te.*m

Text	Matches	Meaning
tem	1: tem	sequence of 'te' followed by zero of more of any character followed by 'm'
te am	1: te am	sequence of 'te' followed by zero of more of any character followed by 'm'
te aasdsad1m	1: te as1asdasd as m	sequence of 'te' followed by zero of more of any character followed by 'm'

Order precedence

The most important part of understanding a regex is understanding precedence.
Lets say we have the regex one|two which means search for a string with a sequence of one or two but for onewo or ontwo.
Consider boe* which means search for a string with the sequence bo with zero or more occurrences of e but not for boeboe or boeboeboe.
This is due to order precedence.
Certain operators have a higher order of precedence and evaluated first similar to algebra.
Similar to evaluating an algebraic expression, regex also needs to be evaluated in order.
Parenthesis allows us to increase the order of precedence

Order of precedence is as follow:

Operator	Symbols
Parenthesis	()
Counters	*, +, ?, {}
Sequences and Anchors	sequence, ^, $, \b, \B
Disjunction	\|

Since Disjunction has lower precedence than Sequences: one|two the two sequences one and two are first evaluated then the disjunction.
Since counters have higher priority than sequences: boe* the sequence is evaluated after and counter is applied to the character to its left.
However with parenthesis we can manipulate the order to our liking
If we want to search for onewo or ontwo instead of one or two we can use the regex: on(e|t)wo
If we want to search for poepoepoe where the string poe is repeated 0 or more times: (poe)*

Ex:

For the expression poe*

Text	Matches	Meaning
po	1: po	sequence of 'po' followed by 0 or more 'e's
poeee	1: poeeee	sequence of 'po' followed by 0 or more e's
poepoe	1: poepoe	sequence of 'po' followed by 0 or more e's

For the expression: (poe)*

Text	Matches	Meaning
poepoe	2: poepoe	sequence of 'poe' repeated 0 or more times
	1:	sequence of 'poe' repeated 0 or more times

For the expression: one|two

Text	Matches	Meaning
one	1: one	sequence of 'one' or 'two'
two	1: two	sequence of 'one' or 'two'
one two	2: one two	sequence of 'one' or 'two'
onewo	1: ontwo	sequence of 'one' or 'two'
ontwo	1: onewo	sequence of 'one' or 'two'

For the expression: on(e|t)wo

Text	Matches	Meaning
one	0: one	sequence of 'on' followed by either 'e' or 't' followed by a sequence of 'wo'
two	0: two	sequence of 'on' followed by either 'e' or 't' followed by a sequence of 'wo'
one two	0: one two	sequence of 'on' followed by either 'e' or 't' followed by a sequence of 'wo'
onewo	1: ontwo	sequence of 'on' followed by either 'e' or 't' followed by a sequence of 'wo'
ontwo	1: onewo	sequence of 'on' followed by either 'e' or 't' followed by a sequence of 'wo'

Other special characters:

These characters are mainly used as shorthand for common patterns:

Characters	Expansion	Meaning
\d	[0-9]	any digit
\D	[^0-9]	any non digit
\w	[a-zA-Z0-9_]	any uppercase or lowercase letter or digit or underscore
\W	[^\w]	any character that is not a letter, digit or underscore
\s	[\r\t\n\f]	any whitespace, tab or newline
\S	[^\s]	non whitespace

Greedy expressions

Regular expressions are greedy by default.
Lets say we have a input string of "His name was poeee" to which we apply the regular expression *"poe"**.
The regular expression searches for a string with a sequence of 'po' followed by 0 or more e's.
Now given our input string it can return po, poe, poee or poeee.
However, from these options the string that will be returned as an output would be poeee.
This is because regex are greedy and will try to match the expression to the longest possible string.
However, we can also enforce a non-greedy approach by using the ? symbol along with the * or + symbols.

Ex:

For the text: His name was poeee

Expression	Matches	Meaning
poe*	1: His name was poeee	The longest string match of the sequence 'po' followed by 0 or more e's
poe+	1: His name was poeee	The longest string match of the sequence 'po' followed by 1 or more e's
poe*?	1: His name was poeee	The shortest string match of the sequence 'po' followed by 0 or more e's
poe+?	1: His name was poeee	The shortest string match of the sequence 'po' followed by 1 or more e's

Substitutions and Capture Groups

In addition to changing the default order of precedence, parenthesis(()) can also be used to capture and store patterns.
If a pattern within a parenthesis matches an expression that pattern gets stored in a memory register.
For example lets say we have an input string of: He was big but his opponent was bigger.
We then apply a regular expression to it of: He was (.)* but his opponent was \1ger.
This expression will create a valid match as the parenthesis matches the string big and stores it within a memory register references by \1.

Ex:

For the expression: He was (.*) but his opponent was \1er

Text	Matches	Meaning
He was big but his opponent was biger	1: He was big but his opponent was biger	The parenthesis captures the sequence between was and but and stores it in register referenced by \1 which is big
He was fast but his opponent was faster	1: He was fast but his opponent was faster	The parenthesis captures the sequence between was and but and stores it in register referenced by \1 which is fast
He was fast but his opponent was biger	0: He was fast but his opponent was biger	The parenthesis captures the sequence between was and but and stores it in register referenced by \1 which is fast

Regular expressions are evaluated from left to right and therefore the strings matched by the expressions get stored in memory registers in order.
It is possible to store multiple expressions and reference them using the sequential order they are stored in.
\1 will store the fist expression captured within parenthesis while \2 will match the second and so on.
A common use case for these capture groups is for substitutions.

Ex:

For the input string: 42 apples, 17 bananas, 3 oranges

Lets say we want to get the total number of fruits

./substiution.py

import re
# Define a regular expression pattern with capture groups
pattern = r'(\d+) (\w+)'
# Input string
text = '42 apples, 17 bananas, 3 oranges'
# Create an empty list to store fruits
fruits_list = []
count = 0
# Define a substitution function that uses captured groups
def repl(match):
    global count
    number = int(match.group(1))
    fruit = match.group(2)
    count += number  # Increment the 'count' variable
    fruits_list.append(fruit)  # Add 'fruit' to the 'fruits_list'    
    # Manipulate the captured data or construct a new string
    return f'There are {number} {fruit} in the basket.'
# Perform substitutions using re.sub()
result = re.sub(pattern, repl, text)
print('There are {0} fruits in the basket of {1}'.format(count, ','.join(fruits_list)))
print(result)

This will print:

console

There are 62 fruits in the basket of apples,bananas,oranges
There are 42 apples in the basket., There are 17 bananas in the basket., There are 3 oranges in the basket.

Parenthesis by default stores the matched expression in a memory register.
However, there may be times we want to avoid such a behavior and use it just as a means of altering the order of precedence of an expression.
This is where we can use the ? symbol to denote that we want to avoid such a behavior.
For example if we have the regex: (? some|a few) (people|cats) like some \1.
The expression that will be stored in memory would be people|cats and not some|few.

Ex:

For the expression: (?:some|few) (people|cats) like some \1

Text	Matches	Meaning
some people like some some	0: some people like some some	either the sequence 'some' or 'a few' followed by 'people' or 'cats' followed by 'like some' followed by either 'people' or 'cats' based on what was matched earlier
some people like some few	0: some people like some few	either the sequence 'some' or 'a few' followed by 'people' or 'cats' followed by 'like some' followed by either 'people' or 'cats' based on what was matched earlier
some people like some cats	0: some people like some cats	either the sequence 'some' or 'a few' followed by 'people' or 'cats' followed by 'like some' followed by either 'people' or 'cats' based on what was matched earlier
some people like some people	1: some people like some people	either the sequence 'some' or 'a few' followed by 'people' or 'cats' followed by 'like some' followed by either 'people' or 'cats' based on what was matched earlier

Lookahead assertions

By default expressions are evaluated from left to right.
Lookahead assertions allow us to scan the text ahead without advancing the match pointer.
Lookahead assertions are denoted as either ?= or ?! within ().
?= is used to check for whether the expression matches and ?! is used to check if the expression does not match.

Ex:

For the text: I love apple pie, but not apple juice or apple tart. apple pie is the best.

Expression	Matches	Meaning
apple(?= pie)	2: I love apple pie, but not apple juice or apple tart. apple pie is the best.	the sequence 'apple' followed by ' pie'
apple(!= pie)	2: I love apple pie, but not apple juice or apple tart. apple pie is the best.	the sequence 'apple' followed by anything but ' pie'

Practical examples

Based on the concepts above lets' use Regex for one of the most common uses in a web application.
For testing password strength.
Lets first define the criteria needed for a modern password:
At,least 8 characters.
Must contain an uppercase character, a lowercase character, a number and a symbol
Lets start by defining our first requirement:
.{8,}
The wildcard proceeded by a counter of a minimum of 8 ensures that our password

./pwdStrength.py

def test_password(password, pattern):
    compiled_pattern = re.compile(pattern)
    if compiled_pattern.match(password):
        print('Password "{0}" passes for expression: {1}\n'.format(password, pattern))
    else:
        print('Password "{0}" fails for expression: {1}\n'.format(password, pattern))
 
# Allow any characters (wildcard)
starting_regex = r'.*'
password1 = ''
password2 = 'abc'
password3 = 'abcdefgh'
test_password(password1, starting_regex)
test_password(password2, starting_regex)
test_password(password3, starting_regex)
 
# Minimum 8 characters requirement
min_char_regex = r'^.\{8,\}$'
password4 = 'abcdefghasdaaaaaaaaaaaa'
test_password(password1, min_char_regex)
test_password(password2, min_char_regex)
test_password(password3, min_char_regex)
test_password(password4, min_char_regex)
 
# We still dont have upper case requirement
# Our password needs to contain atleast one upper case character
# Lets use a lookahead expression to check whether there is a upper case letter
upper_case_regex = r'^(?=.*?[A-Z]).\{8,\}$'
password5 = 'aAa'
password6 = 'AVCDEFGHI'
password7 = 'asdaAdasd123'
test_password(password1, upper_case_regex)
test_password(password2, upper_case_regex)
test_password(password3, upper_case_regex)
test_password(password4, upper_case_regex)
test_password(password5, upper_case_regex)
test_password(password6, upper_case_regex)
test_password(password7, upper_case_regex)
 
# We still have some more requirements;
# Need a lower case, number and a special character
final_regex = r'^(?=.*?[A-Z])(?=.*?[a-z])(?=.*?[0-9])(?=.*?[#?!@$%^&*-]).\{8,\}$'
password7 = 'asdaAdasd123'
password8 = 'aA2!'
password9 = '#a3aA'
password10 = 'asdaAdasd12!3'
password11 = 'Password123#'
test_password(password1, final_regex)
test_password(password2, final_regex)
test_password(password3, final_regex)
test_password(password4, final_regex)
test_password(password5, final_regex)
test_password(password6, final_regex)
test_password(password7, final_regex)
test_password(password8, final_regex)
test_password(password9, final_regex)
test_password(password10, final_regex)
test_password(password11, final_regex)

This will print:

console

Password "" passes for expression: .*
 
Password "abc" passes for expression: .*
 
Password "abcdefgh" passes for expression: .*
 
Password "" fails for expression: ^.{8,}$
 
Password "abc" fails for expression: ^.{8,}$
 
Password "abcdefgh" passes for expression: ^.{8,}$
 
Password "abcdefghasdaaaaaaaaaaaa" passes for expression: ^.{8,}$
 
Password "" fails for expression: ^(?=.*?[A-Z]).{8,}$
 
Password "abc" fails for expression: ^(?=.*?[A-Z]).{8,}$
 
Password "abcdefgh" fails for expression: ^(?=.*?[A-Z]).{8,}$
 
Password "abcdefghasdaaaaaaaaaaaa" fails for expression: ^(?=.*?[A-Z]).{8,}$
 
Password "aAa" fails for expression: ^(?=.*?[A-Z]).{8,}$
 
Password "AVCDEFGHI" passes for expression: ^(?=.*?[A-Z]).{8,}$
 
Password "asdaAdasd123" passes for expression: ^(?=.*?[A-Z]).{8,}$
 
Password "" fails for expression: ^(?=.*?[A-Z])(?=.*?[a-z])(?=.*?[0-9])(?=.*?[#?!@$%^&*-]).{8,}$
 
Password "abc" fails for expression: ^(?=.*?[A-Z])(?=.*?[a-z])(?=.*?[0-9])(?=.*?[#?!@$%^&*-]).{8,}$
 
Password "abcdefgh" fails for expression: ^(?=.*?[A-Z])(?=.*?[a-z])(?=.*?[0-9])(?=.*?[#?!@$%^&*-]).{8,}$
 
Password "abcdefghasdaaaaaaaaaaaa" fails for expression: ^(?=.*?[A-Z])(?=.*?[a-z])(?=.*?[0-9])(?=.*?[#?!@$%^&*-]).{8,}$
 
Password "aAa" fails for expression: ^(?=.*?[A-Z])(?=.*?[a-z])(?=.*?[0-9])(?=.*?[#?!@$%^&*-]).{8,}$
 
Password "AVCDEFGHI" fails for expression: ^(?=.*?[A-Z])(?=.*?[a-z])(?=.*?[0-9])(?=.*?[#?!@$%^&*-]).{8,}$
 
Password "asdaAdasd123" fails for expression: ^(?=.*?[A-Z])(?=.*?[a-z])(?=.*?[0-9])(?=.*?[#?!@$%^&*-]).{8,}$
 
Password "aA2!" fails for expression: ^(?=.*?[A-Z])(?=.*?[a-z])(?=.*?[0-9])(?=.*?[#?!@$%^&*-]).{8,}$
 
Password "#a3aA" fails for expression: ^(?=.*?[A-Z])(?=.*?[a-z])(?=.*?[0-9])(?=.*?[#?!@$%^&*-]).{8,}$
 
Password "asdaAdasd12!3" passes for expression: ^(?=.*?[A-Z])(?=.*?[a-z])(?=.*?[0-9])(?=.*?[#?!@$%^&*-]).{8,}$
 
Password "Password123#" passes for expression: ^(?=.*?[A-Z])(?=.*?[a-z])(?=.*?[0-9])(?=.*?[#?!@$%^&*-]).{8,}$

Generating a final regular expression consists of fixing two kinds of errors.
Removing false positives. (Strings we incorrectly matched like abc for the expressions .*) also called precision.
Including false negatives. (Strings we missed) also called recall.
We did not have false negatives in the example above but consider we had the expression instead: ^(?=.?[A-Z])(?=.?[a-z])(?=.?[0-9])(?=.?[#?!@$%^&*-]).{9,}$
We would then miss passwords like "Passwo12#" with 8 characters because we defined a minimum of 9 characters in our expression.
Such a case would be considered a false negative.