Regular Expressions are implemented in almost all programming languages, which you can take as a sign that they're immensely useful.
They can be used to find email addresses: (\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b)
Or validating password strength: (^(?=.*[A-Za-z])(?=.*\d)[A-Za-z\d]{8,}$)
At this point, their main downside should become obvious: They're utterly unreadable.
How has anyone created these monster regexes above? The answer lies in the fact that when writing regexes, you usually start out with a specific task and then add more and more complexity to deal with special cases. We'll work through some examples below.
Regexes themselves are made up of building blocks.
For a good, quick reference check out http://rubular.com/
For the following, we'll work with the string "This is a TEST string 123. risi@stanford.edu."
# import the regular expressions package
import re
test_string = "This is a TEST string 123. risi@stanford.edu"
[a-z]
matches any one lower case character.
# re.findall returns a list of all regex matches.
print re.findall(r'[a-z]', test_string)
[A-Z]
matches any upper case letter.
print re.findall(r'[A-Z]', test_string)
These can be combined.
[a-zA-Z0-9#]
matches all lower and upper case letters as well as all numbers and, for good measure, also '#' even though it does not appear in the test string.
print re.findall(r'[a-zA-Z0-9]', test_string)
We can also specify characters not to match:
[^tT]
matches all characters except Ts (including spaces, periods etc.)
print re.findall(r'[^tT]', test_string)
However, we may not want to match just single characters. Rather, we usually want to extract groups, for example all words consisting of four letters.
For this, we need to modify the length requirement with one of the following:
[a-z]+
The plus means "one or more," i.e. it will match one or more of whatever is in the square brackets.
[a-z]?
means zero or one
[a-z]*
means zero or more
[a-z]{3}
means exactly 3
[a-z]{3,}
means 3 or more
[a-z]{3,5}
means between 3 and 5
To match all "words" made up of numbers, we use [0-9]+
print re.findall(r'[0-9]+', test_string)
Regex becomes powerful as soon as we combine multiple groups. As a simple example, we may want to match all words that start with a "T".
T[a-zA-Z]+
Note: single characters do not need to be put into square brackets.
print re.findall(r'T[a-zA-Z]+', test_string)
We can now also match the email address:
[a-zA-Z]+@[a-zA-Z.]+
print re.findall(r'[a-zA-Z]+@[a-zA-Z.]+', test_string)
What pattern would we need to extract all of the dates in the following list:
2/18/2016 Date of Presentation
20.2.1975 Wer schreibt schon den Monat zuerst?
2012-10-20 Can we just agree on a formatting?
2003.11.15 No.
Let's use rubular for the task: http://rubular.com/
The rubular website allows you to develop regexes incrementally. Generally, I don't know the precise regex that I need and instead just tweak it until it works for a number of test cases.
The last important concept are boundaries. So far, we could not extract all words consisting purely of lower case letters:
print re.findall(r'[a-z]+', test_string)
Regex matches the 'his' part of 'This' because we haven't told it to pay attention to word boundaries. We can do this with a special character:
\b
matches a word boundary like a space, comma, period but also @.
print re.findall(r'\b[a-z]+\b', test_string)
There are a number of similar special characters:
\s
matches any whitespace character. e.g \s[a-z]+\s would return the words including the surrounding white space
\b
matches word boundaries but does not return the space or period etc.
\w
matches any word character (letter, number, underscore).
.
matches any single character.
^
matches the start of a line. e.g. ^[a-zA-Z]+ matches the first word in the line (This).
$
matches the end of the line.
One of the quirks of regex is that special characters usually need to be "escaped." This means that they need to be prefixed with a back slash.
So, to match a forward slash, you need to use \/
. To match a back slash, use \\
.
The rules for when to escape are complicated and I don't full get them. For example:
.
matches any single character
[.]
matches one single period
\.
matches one single period
As a rule of thumb, it generally makes sense to escape special characters but testing that your regex works is more important.
The best way to get better with regex is simply practice. You can also pay a visit to Regex Crossword.
One of the most frequent tasks when working with text is splitting the text into individual words. This is also called "tokenization." Once the text is tokenized, we can process it further, for example by counting words.
Here, we'll use a 150 page deposition of a former Philip Morris scientist: https://industrydocuments.library.ucsf.edu/tobacco/docs/#id=kxmp0018
# import the regular expressions package
import re
# import the Counter data type, which we'll use to count word occurrences.
from collections import Counter
import urllib2
# load the document from the web
document = urllib2.urlopen(urllib2.Request('https://s3-us-west-1.amazonaws.com/tobaccodata/deposition.txt')).read()
# to get an idea of what the document looks like, let's print the first 500 characters.
print "Here's what the beginning of the document looks like:"
print document[0:1000]
# tokenize the text after making it lower case
tokens = re.findall(r'(\b[^\s]+\b)', document.lower())
print "\nThe first 10 tokens are: "
print tokens [0:10]
word_counter = Counter(tokens)
print "\nThe 20 most common words in the document are:"
print word_counter.most_common(20)
Text can also be cleaned using regex.
Here, the original text contains line numbers (1 to 25) at the beginning of every page. We can eliminate those using the re.sub command.
re.sub replaces the matched group with another string, here just a space.
# re.sub takes 3 parameters: the regex, the replacement string, and the input text.
document_cleaned = re.sub(r'[0-9]{1,2}', ' ', document)
print "Number of '20' in the original document: {} ".format(word_counter['20'])
print "Number of '20' in the cleaned document: {}".format(Counter(document_cleaned)['20'])
We can also replace the page number at the beginning of every page or the urls at the bottom.
document_cleaned = re.sub(r'# [0-9]{1,3}', ' ', document_cleaned)
document_cleaned = re.sub(r'http://legacy.library.ucsf.edu/tid/gyz75a00/pdf', ' ', document_cleaned)
document_cleaned = re.sub(r'http://industrydocuments.library.ucsf.edu/tobacco/docs/kxmp0018', ' ', document_cleaned)
What you want to clean will always depend on your particular use case. Maybe, as in this case, there are line numbers at the beginning. Maybe there are dots and dashes all over the place. But whatever the case may be, you'll probably be able to construct a regex to eliminate the offending passage.
For more on text cleaning also have a look at http://programminghistorian.org/lessons/cleaning-ocrd-text-with-regular-expressions
Regex can also be used to extract specific text passages. For example, we may be interested in finding all text passages that contain the term 'nicotine.'
\b.{0,75}nicotine.{0,75}\b
looks for the word nicotine and selects up to 75 characters before and after the term itself. It stops at the word boundary that's closest to 75 characters.
# to match strings across multiple lines, we use the flag re.MULTILINE
nicotine_passages = re.findall(r'\b.{0,75}nicotine.{0,75}\b', document_cleaned, re.MULTILINE)
print "{} passages containing the term nicotine found.".format(len(nicotine_passages))
print "Here are the first 10 passages: "
for i in range(10):
print i
print nicotine_passages[i]