Regular Expressions for Historians

What are Regular Expressions?

Regular Expressions (usually called "regex") are a tool to match text patterns. It can be used for:

  • Text Extraction: Extracting specific bits from a longer text, for example all email addresses.
  • Tokenization: Splitting text into Words
  • Text Cleaning: Eliminating ocr errors or line breaks

Regular Expressions are implemented in almost all programming languages, which you can take as a sign that they're immensely useful.
They can be used to find email addresses: (\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b)
Or validating password strength: (^(?=.*[A-Za-z])(?=.*\d)[A-Za-z\d]{8,}$)

At this point, their main downside should become obvious: They're utterly unreadable.

Building Up Complexity

How has anyone created these monster regexes above? The answer lies in the fact that when writing regexes, you usually start out with a specific task and then add more and more complexity to deal with special cases. We'll work through some examples below.

First Steps

Regexes themselves are made up of building blocks.
For a good, quick reference check out http://rubular.com/
For the following, we'll work with the string "This is a TEST string 123. risi@stanford.edu."

In [1]:
# import the regular expressions package
import re
test_string = "This is a TEST string 123. risi@stanford.edu"

1. What to Match? Character Groups

[a-z] matches any one lower case character.

In [2]:
# re.findall returns a list of all regex matches. 
print re.findall(r'[a-z]', test_string)
['h', 'i', 's', 'i', 's', 'a', 's', 't', 'r', 'i', 'n', 'g', 'r', 'i', 's', 'i', 's', 't', 'a', 'n', 'f', 'o', 'r', 'd', 'e', 'd', 'u']

[A-Z] matches any upper case letter.

In [3]:
print re.findall(r'[A-Z]', test_string)
['T', 'T', 'E', 'S', 'T']

These can be combined.
[a-zA-Z0-9#] matches all lower and upper case letters as well as all numbers and, for good measure, also '#' even though it does not appear in the test string.

In [4]:
print re.findall(r'[a-zA-Z0-9]', test_string)
['T', 'h', 'i', 's', 'i', 's', 'a', 'T', 'E', 'S', 'T', 's', 't', 'r', 'i', 'n', 'g', '1', '2', '3', 'r', 'i', 's', 'i', 's', 't', 'a', 'n', 'f', 'o', 'r', 'd', 'e', 'd', 'u']

We can also specify characters not to match:
[^tT] matches all characters except Ts (including spaces, periods etc.)

In [5]:
print re.findall(r'[^tT]', test_string)
['h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 'E', 'S', ' ', 's', 'r', 'i', 'n', 'g', ' ', '1', '2', '3', '.', ' ', 'r', 'i', 's', 'i', '@', 's', 'a', 'n', 'f', 'o', 'r', 'd', '.', 'e', 'd', 'u']

2. How Many? Specifying Length

However, we may not want to match just single characters. Rather, we usually want to extract groups, for example all words consisting of four letters.
For this, we need to modify the length requirement with one of the following:

[a-z]+ The plus means "one or more," i.e. it will match one or more of whatever is in the square brackets.
[a-z]? means zero or one
[a-z]* means zero or more
[a-z]{3} means exactly 3
[a-z]{3,} means 3 or more
[a-z]{3,5} means between 3 and 5

To match all "words" made up of numbers, we use [0-9]+

In [6]:
print re.findall(r'[0-9]+', test_string)
['123']

3. Multiple Groups

Regex becomes powerful as soon as we combine multiple groups. As a simple example, we may want to match all words that start with a "T".
T[a-zA-Z]+ Note: single characters do not need to be put into square brackets.

In [7]:
print re.findall(r'T[a-zA-Z]+', test_string)
['This', 'TEST']

We can now also match the email address:
[a-zA-Z]+@[a-zA-Z.]+

In [8]:
print re.findall(r'[a-zA-Z]+@[a-zA-Z.]+', test_string)
['risi@stanford.edu']

Example: Matching Dates

What pattern would we need to extract all of the dates in the following list:
2/18/2016 Date of Presentation
20.2.1975 Wer schreibt schon den Monat zuerst?
2012-10-20 Can we just agree on a formatting?
2003.11.15 No.

Let's use rubular for the task: http://rubular.com/
The rubular website allows you to develop regexes incrementally. Generally, I don't know the precise regex that I need and instead just tweak it until it works for a number of test cases.

4. Boundaries

The last important concept are boundaries. So far, we could not extract all words consisting purely of lower case letters:

In [9]:
print re.findall(r'[a-z]+', test_string)
['his', 'is', 'a', 'string', 'risi', 'stanford', 'edu']

Regex matches the 'his' part of 'This' because we haven't told it to pay attention to word boundaries. We can do this with a special character:
\b matches a word boundary like a space, comma, period but also @.

In [10]:
print re.findall(r'\b[a-z]+\b', test_string)
['is', 'a', 'string', 'risi', 'stanford', 'edu']

There are a number of similar special characters:
\s matches any whitespace character. e.g \s[a-z]+\s would return the words including the surrounding white space
\b matches word boundaries but does not return the space or period etc.
\w matches any word character (letter, number, underscore).
. matches any single character.
^ matches the start of a line. e.g. ^[a-zA-Z]+ matches the first word in the line (This).
$ matches the end of the line.

5. Escaping

One of the quirks of regex is that special characters usually need to be "escaped." This means that they need to be prefixed with a back slash.
So, to match a forward slash, you need to use \/ . To match a back slash, use \\ .
The rules for when to escape are complicated and I don't full get them. For example:
. matches any single character
[.] matches one single period
\. matches one single period

As a rule of thumb, it generally makes sense to escape special characters but testing that your regex works is more important.

Congratulations! You now know the basics of Regular Expressions.

The best way to get better with regex is simply practice. You can also pay a visit to Regex Crossword.

Examples

Example: Text Tokenization

One of the most frequent tasks when working with text is splitting the text into individual words. This is also called "tokenization." Once the text is tokenized, we can process it further, for example by counting words.

Here, we'll use a 150 page deposition of a former Philip Morris scientist: https://industrydocuments.library.ucsf.edu/tobacco/docs/#id=kxmp0018

In [11]:
# import the regular expressions package
import re
# import the Counter data type, which we'll use to count word occurrences.
from collections import Counter
import urllib2

# load the document from the web
document = urllib2.urlopen(urllib2.Request('https://s3-us-west-1.amazonaws.com/tobaccodata/deposition.txt')).read()

# to get an idea of what the document looks like, let's print the first 500 characters.
print "Here's what the beginning of the document looks like:"
print document[0:1000]

# tokenize the text after making it lower case
tokens = re.findall(r'(\b[^\s]+\b)', document.lower())
print "\nThe first 10 tokens are: "
print tokens [0:10]

word_counter = Counter(tokens)
print "\nThe 20 most common words in the document are:"
print word_counter.most_common(20)
Here's what the beginning of the document looks like:
# pg 1
00265 1 IN THE UNITED STATES DISTRICT COURT FOR THE DISTRICT OF SOUTH CAROLINA 2 CHARLESTON DIVISION 3 - - - 4 SUZANNE Q. LITTLE, individually ) and as Personal Representative of ) 5 the Estate of SAMUEL MARTIN LITTLE, ) Deceased, ) 6 ) Plaintiff, ) 7 ) vs. ) No. 2-98-1879-23 8 ) BROWN & WILLIAMSON TOBACCO ) 9 CORPORATION, individually and as ) successor by merger to THE AMERICAN ) 10 TOBACCO COMPANY, and R.J. REYNOLDS ) TOBACCO COMPANY, ) 11 ) Defendants. ) 12 - - - - - - - - - - - - - - - - - - - 13 14 VOLUME 2 15 DEPOSITION OF 16 WILLIAM A. FARONE, Ph.D. 17 IRVINE, CALIFORNIA 18 APRIL 27, 2000 19 20 21 ATKINSON-BAKER, INC. COURT REPORTERS 22 330 North Brand Boulevard, Suite 250 Glendale, California 91203 23 (818) 551-7300 24 REPORTED BY: GLENNA J. McNEALY, CSR NO. 9138 25 FILE NO.: 9A026AE 00266 1 IN THE UNITED STATES DISTRICT COURT FOR THE DISTRICT OF SOUTH CAROLINA 2 CHARLESTON DIVISION 3 - - - 4 SUZANNE Q. LITTLE, individually ) and as Personal Representative of ) 5 the Es

The first 10 tokens are:
['pg', '1', '00265', '1', 'in', 'the', 'united', 'states', 'district', 'court']

The 20 most common words in the document are:
[('the', 3869), ('a', 2550), ('that', 2411), ('of', 1752), ('q', 1552), ('you', 1480), ('to', 1368), ('and', 1358), ('is', 1301), ('in', 1252), ('it', 1029), ('i', 915), ('okay', 757), ('correct', 535), ('nicotine', 525), ('was', 516), ('yes', 486), ('this', 467), ('so', 465), ('not', 439)]

Example: Text Cleaning

Text can also be cleaned using regex.
Here, the original text contains line numbers (1 to 25) at the beginning of every page. We can eliminate those using the re.sub command.
re.sub replaces the matched group with another string, here just a space.

In [12]:
# re.sub takes 3 parameters: the regex, the replacement string, and the input text.
document_cleaned = re.sub(r'[0-9]{1,2}', ' ', document)

print "Number of '20' in the original document: {} ".format(word_counter['20'])
print "Number of '20' in the cleaned document: {}".format(Counter(document_cleaned)['20'])
Number of '20' in the original document: 413
Number of '20' in the cleaned document: 0

We can also replace the page number at the beginning of every page or the urls at the bottom.

In [13]:
document_cleaned = re.sub(r'# [0-9]{1,3}', ' ', document_cleaned)
document_cleaned = re.sub(r'http://legacy.library.ucsf.edu/tid/gyz75a00/pdf', ' ', document_cleaned)
document_cleaned = re.sub(r'http://industrydocuments.library.ucsf.edu/tobacco/docs/kxmp0018', ' ', document_cleaned)

What you want to clean will always depend on your particular use case. Maybe, as in this case, there are line numbers at the beginning. Maybe there are dots and dashes all over the place. But whatever the case may be, you'll probably be able to construct a regex to eliminate the offending passage.
For more on text cleaning also have a look at http://programminghistorian.org/lessons/cleaning-ocrd-text-with-regular-expressions

Example: Text Extraction

Regex can also be used to extract specific text passages. For example, we may be interested in finding all text passages that contain the term 'nicotine.'
\b.{0,75}nicotine.{0,75}\b looks for the word nicotine and selects up to 75 characters before and after the term itself. It stops at the word boundary that's closest to 75 characters.

In [14]:
# to match strings across multiple lines, we use the flag re.MULTILINE
nicotine_passages = re.findall(r'\b.{0,75}nicotine.{0,75}\b', document_cleaned, re.MULTILINE)

print "{} passages containing the term nicotine found.".format(len(nicotine_passages))
print "Here are the first 10 passages: "
for i in range(10):
    print i
    print nicotine_passages[i]
461 passages containing the term nicotine found.
Here are the first 10 passages:
0
May  ,        - Article entitled, "Regional    deposition of inhaled     C-nicotine vapor in the human airway as visualized by position   emission tomography
1
 that?   A. Yes.   Q. And it reads, "Our critics have lumped 'tar'   and nicotine together in their allegations about health   hazards, perhaps because 'tar
2
' and nicotine are   generated together in varying proportions when tobacco   is smoked.
3
 approach to reducing the amount of 'tar' in   cigarette smoke per unit of nicotine." Do you see that?   A. Yes.   Q. Okay. Now, Dr. Teague is talking about
4
when   he says reducing the amount of tar in cigarette smoke   per unit of nicotine, he's talking about reducing the   tar-to-nicotine ratio, is he not
5
.   Q. Okay. Now, reading on, "that" -- that being   reducing tar and nicotine ratio -- "is probably the most   realistic approach in today's market for
6
 can turn to Page  . "If our business is   fundamentally that of supplying nicotine in useful   dosage form, why is it really necessary that allegedly
7
harmful 'tar' accompany that nicotine?" Do you see   that?   A. I do.   Q. So he's saying, would you agree, that
8
 the   company should search for a way to supply nicotine but   without the harmful tar, correct?   A. Yes.   Q. Okay. And again,
9
that's the concept of   reducing tar-to-nicotine ratio, correct?   A. Well, he's actually going further, as I see       it