Introduction
If you ever worked with text data in Python, you know that most of the time you have to deal with many data. This means that the specific text you’re looking for (names, dates, email addresses, etc) would need to be extracted before it’s ready to use.
This is when regular expressions (regex) come in handy. Thanks to its syntax you will be able to find the pattern you wish to extract and save immense time.
The start and end of a string
- ^: match a string start to reject the inner matched string
- $: matches a string end to reject the inner matched string
Meta characters (., \w, \d, \b, and \s)
In regexs, meta characters are understood as characters with a special meaning like the following:
- \d: matches a digit (0 - 9)
- \D: the reverse of \d, i.e. except digts
- \w: matches a word character (a - z, A - Z, 0 - 9, and _)
- \W: the reverse of \w
- \b: matches at the position between a word character (\w) and a non-word character (\W)
- \s: matches a white space (space, tab, and new line)
- \S: the reverse of \s
Quantifiers (*, +, ?, and {})
Quantifiers will help us find a character that is repeated several times.
- *: followed by 0 or more
- +: followed by 1 or more
- ?: followed by 0 or 1
- {m}: followed by m
- {m, }: followed by m or more
- { , n}: followed by 0 up to n
- {m, n}: followed by m up to n
Group and capture
To capture a specific group of characters whithin a pattern just add parentheses.
- We can create a Python regex group by enclosing part of a regex between parenthesis; e.g. (\d{4})-(\d{2})-(\d{2}) is a regex for a YYYY-MM-DD date with three groups, namely, (from left to right) (\d{4}), which has index 1 and which captures the year, (\d{2}), which has index 2 and which captures the month, and another (\d{2}), which has index 3 and which captures the day. After matching a text to a regex using one of the searching functions, we can retrieve all substrings of the groups within that regex by using the groups() method of the match object.
Match with an expression
Unlike parentheses, square brackets don’t capture an expression but only match anything inside it.
- [789]: matches a string that has 7, 8, or 9
- [7-9]: same as the previous
- [^789]: the reverse of [789]. In here, ^ is used as negation.