# Regular Expressions for Historians

## What are Regular Expressions?

Regular Expressions (usually called "regex") are a tool to match text patterns. It can be used for:

• Text Extraction: Extracting specific bits from a longer text, for example all email addresses.
• Tokenization: Splitting text into Words
• Text Cleaning: Eliminating ocr errors or line breaks

Regular Expressions are implemented in almost all programming languages, which you can take as a sign that they're immensely useful.
They can be used to find email addresses: (\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b)

#### 5. Escaping

One of the quirks of regex is that special characters usually need to be "escaped." This means that they need to be prefixed with a back slash.
So, to match a forward slash, you need to use \/ . To match a back slash, use \\ .
The rules for when to escape are complicated and I don't full get them. For example:
.  matches any single character
[.] matches one single period
\.  matches one single period

As a rule of thumb, it generally makes sense to escape special characters but testing that your regex works is more important.

### Congratulations! You now know the basics of Regular Expressions.

The best way to get better with regex is simply practice. You can also pay a visit to Regex Crossword.

## Examples

### Example: Text Tokenization

One of the most frequent tasks when working with text is splitting the text into individual words. This is also called "tokenization." Once the text is tokenized, we can process it further, for example by counting words.

Here, we'll use a 150 page deposition of a former Philip Morris scientist: https://industrydocuments.library.ucsf.edu/tobacco/docs/#id=kxmp0018

In [11]:
# import the regular expressions package
import re
# import the Counter data type, which we'll use to count word occurrences.
from collections import Counter
import urllib2

# load the document from the web

# to get an idea of what the document looks like, let's print the first 500 characters.
print "Here's what the beginning of the document looks like:"
print document[0:1000]

# tokenize the text after making it lower case
tokens = re.findall(r'(\b[^\s]+\b)', document.lower())
print "\nThe first 10 tokens are: "
print tokens [0:10]

word_counter = Counter(tokens)
print "\nThe 20 most common words in the document are:"
print word_counter.most_common(20)

Here's what the beginning of the document looks like:
# pg 1
00265 1 IN THE UNITED STATES DISTRICT COURT FOR THE DISTRICT OF SOUTH CAROLINA 2 CHARLESTON DIVISION 3 - - - 4 SUZANNE Q. LITTLE, individually ) and as Personal Representative of ) 5 the Estate of SAMUEL MARTIN LITTLE, ) Deceased, ) 6 ) Plaintiff, ) 7 ) vs. ) No. 2-98-1879-23 8 ) BROWN & WILLIAMSON TOBACCO ) 9 CORPORATION, individually and as ) successor by merger to THE AMERICAN ) 10 TOBACCO COMPANY, and R.J. REYNOLDS ) TOBACCO COMPANY, ) 11 ) Defendants. ) 12 - - - - - - - - - - - - - - - - - - - 13 14 VOLUME 2 15 DEPOSITION OF 16 WILLIAM A. FARONE, Ph.D. 17 IRVINE, CALIFORNIA 18 APRIL 27, 2000 19 20 21 ATKINSON-BAKER, INC. COURT REPORTERS 22 330 North Brand Boulevard, Suite 250 Glendale, California 91203 23 (818) 551-7300 24 REPORTED BY: GLENNA J. McNEALY, CSR NO. 9138 25 FILE NO.: 9A026AE 00266 1 IN THE UNITED STATES DISTRICT COURT FOR THE DISTRICT OF SOUTH CAROLINA 2 CHARLESTON DIVISION 3 - - - 4 SUZANNE Q. LITTLE, individually ) and as Personal Representative of ) 5 the Es

The first 10 tokens are:
['pg', '1', '00265', '1', 'in', 'the', 'united', 'states', 'district', 'court']

The 20 most common words in the document are:
[('the', 3869), ('a', 2550), ('that', 2411), ('of', 1752), ('q', 1552), ('you', 1480), ('to', 1368), ('and', 1358), ('is', 1301), ('in', 1252), ('it', 1029), ('i', 915), ('okay', 757), ('correct', 535), ('nicotine', 525), ('was', 516), ('yes', 486), ('this', 467), ('so', 465), ('not', 439)]


### Example: Text Cleaning

Text can also be cleaned using regex.
Here, the original text contains line numbers (1 to 25) at the beginning of every page. We can eliminate those using the re.sub command.
re.sub replaces the matched group with another string, here just a space.

In [12]:
# re.sub takes 3 parameters: the regex, the replacement string, and the input text.
document_cleaned = re.sub(r'[0-9]{1,2}', ' ', document)

print "Number of '20' in the original document: {} ".format(word_counter['20'])
print "Number of '20' in the cleaned document: {}".format(Counter(document_cleaned)['20'])

Number of '20' in the original document: 413
Number of '20' in the cleaned document: 0


We can also replace the page number at the beginning of every page or the urls at the bottom.

In [13]:
document_cleaned = re.sub(r'# [0-9]{1,3}', ' ', document_cleaned)
document_cleaned = re.sub(r'http://legacy.library.ucsf.edu/tid/gyz75a00/pdf', ' ', document_cleaned)
document_cleaned = re.sub(r'http://industrydocuments.library.ucsf.edu/tobacco/docs/kxmp0018', ' ', document_cleaned)


What you want to clean will always depend on your particular use case. Maybe, as in this case, there are line numbers at the beginning. Maybe there are dots and dashes all over the place. But whatever the case may be, you'll probably be able to construct a regex to eliminate the offending passage.
For more on text cleaning also have a look at http://programminghistorian.org/lessons/cleaning-ocrd-text-with-regular-expressions

### Example: Text Extraction

Regex can also be used to extract specific text passages. For example, we may be interested in finding all text passages that contain the term 'nicotine.'
\b.{0,75}nicotine.{0,75}\b looks for the word nicotine and selects up to 75 characters before and after the term itself. It stops at the word boundary that's closest to 75 characters.

In [14]:
# to match strings across multiple lines, we use the flag re.MULTILINE
nicotine_passages = re.findall(r'\b.{0,75}nicotine.{0,75}\b', document_cleaned, re.MULTILINE)

print "{} passages containing the term nicotine found.".format(len(nicotine_passages))
print "Here are the first 10 passages: "
for i in range(10):
print i
print nicotine_passages[i]

461 passages containing the term nicotine found.
Here are the first 10 passages:
0
May  ,        - Article entitled, "Regional    deposition of inhaled     C-nicotine vapor in the human airway as visualized by position   emission tomography
1
that?   A. Yes.   Q. And it reads, "Our critics have lumped 'tar'   and nicotine together in their allegations about health   hazards, perhaps because 'tar
2
' and nicotine are   generated together in varying proportions when tobacco   is smoked.
3
approach to reducing the amount of 'tar' in   cigarette smoke per unit of nicotine." Do you see that?   A. Yes.   Q. Okay. Now, Dr. Teague is talking about
4
when   he says reducing the amount of tar in cigarette smoke   per unit of nicotine, he's talking about reducing the   tar-to-nicotine ratio, is he not
5
.   Q. Okay. Now, reading on, "that" -- that being   reducing tar and nicotine ratio -- "is probably the most   realistic approach in today's market for
6
can turn to Page  . "If our business is   fundamentally that of supplying nicotine in useful   dosage form, why is it really necessary that allegedly
7
harmful 'tar' accompany that nicotine?" Do you see   that?   A. I do.   Q. So he's saying, would you agree, that
8
the   company should search for a way to supply nicotine but   without the harmful tar, correct?   A. Yes.   Q. Okay. And again,
9
that's the concept of   reducing tar-to-nicotine ratio, correct?   A. Well, he's actually going further, as I see       it