« PreviousNext »

Introduction

As I discussed in my tutorial on how to extract tables from PDFs, PDFs are notoriously difficult to scrape. It's much easier to scrape a text file, so I recommend converting your PDFs to text files, which I explained how to do in this tutorial, and scraping those. In the following, I will describe how to use re, the Python regular expressions library, to scrape text files.

The files containing all of the code that I use in this tutorial can be found here.

What is a Regular Expression?

A regular expression, or regex, is a sequence of characters that finds all strings matching a specified pattern. Regular expressions are useful for finding phone numbers, email addresses, dates, and any other data that has a consistent format. In the following section, I’ll do a brief introduction to Python’s regular expression syntax.

Python Regular Expression Syntax

In this section, we’ll look at several of the most commonly used regular expression metacharacters in Python.

If you just want to find a specific word or character, your regular expression will be that word or character. For example, if you want to find the word “regex,” your regular expression would be “regex”.

The “[“ and “]” characters are used to specify a character class, which is a set of characters that your regular expression will match. For example, “[abcd]” will match any of the characters “a,” “b,” c,” or “d.” Use a dash to indicate a range of characters; for example, “[a-z]” will match any lowercase character between “a” and “z,” inclusive.

The “*” and “+” characters are used to indicate repetition of the character immediately to their left. “*” indicates zero or more instances of a character, and “+” indicates one or more. For example, “a*” would match zero or more instances of “a,” and “[A-Z]+” would match one or more instances of any character from “A” through “Z.”

The “(“ and “)” characters are used to enclose the part of the regex that you want to be returned. For example, suppose you want to find a line containing a two-digit number in between a variable-length sequence of lowercase letters, and print only the number. You would find the number using the regex “[a-z]*([0-9][0-9])[a-z]*” (translated as “a string with zero or more characters between ‘a’ and ‘z,’ one character between ‘0’ and ‘9,’ one character between ‘0’ and ‘9,’ and zero or more characters between ‘a’ and ‘z.’) and then print it. To learn about more special characters, see this reference.

Regular Expression Methods in Python

Once you’ve written a regular expression that matches the string that you need, you use the methods in re to actually search for it and find it in your file. The method that I find most useful for scraping is the .findall() method, which accepts a regex and a string as a parameter, and returns a list of all substrings in the string that match the regex. Another method that I find useful in a lot of cases is .search(), which accepts the same parameters and returns a match object if it finds a substring that matches the regex. Match objects have a boolean value of true, so you can use them to test if a match has been found. If you’d like to learn about more re methods, a complete reference can be found here.

Example: Finding All Dates In a File

In this example, we’re going to be searching the following text file for dates:

Friday, March 11, 2016
8:35 to 9:35 a.m.
Barona Resort
1932 Wildcat Canyon Road
Lakeside, CA 92040
643765rdddtFriday, March 11, 2016iyfutkdkyrz
ITEM NO.
+1.
The Executive Committee is asked to review and approve the minutes from its
February 12, 2016, meeting.
+3A. Draft Board Business Agenda - March 25, 2016
+3B. Draft Board Policy Agenda --iytdutrs April 8, 2016khnckhnc
+4.

We see that all of the dates we need have the format of the following example: “April 12, 2014.” That is, they begin with a sequence of one or more letters, capital and lowercase; and then have a space, two numbers, a comma, a space, and four numbers. Zero or more characters of any type could be before and after the date. With these observations, we can create a regular expression that will find every line that has a date in it, and extract only the date:

“.*([a-zA-Z]+\s[0-9][0-9],\s[0-9][0-9][0-9][0-9]).*”

I enclosed the expression that matches a date in parentheses, so that only the date is returned, not the entire line. The “\s” matches any whitespace character, and the “.” is a wildcard that matches any character. Now, let’s plug this into our Python code doing the following steps:

  1. Open the text file
  2. Iterate through every line
  3. Return a list of dates in that line
  4. If the list is not empty, which indicates that a date has been found, print the list

Do these steps like so:

Code:

import re
text = open('test.txt') #open text file
for line in text: #iterate through every line
	#return list of dates in that line
	x = re.findall('.*([a-zA-Z]+\s[0-9][0-9],\s[0-9][0-9][0-9][0-9]).*', line) 
	#if a date is found
	if len(x) != 0:
		print(x)

Result:

Whoops, that’s not right! Our regular expression cut off all the letters in the month except for the last one! That’s because the “*” and the other repetition operators are “greedy,” meaning that they try to match as many of the characters that they are attached to as possible. The “*” behind our first wildcard character, “.”, matches all of the characters leading up to the last letter in every month, at which point it has to stop matching because the “[a-zA-Z]+” has to match to at least one uppercase or lowercase letter. Thus, in the line “+3A. Draft Board Business Agenda - March 25, 2016,” for example, the “.*” part of our regular expression matches “+3A. Draft Board Business Agenda - Marc,” and the “([a-zA-Z]+\s[0-9][0-9],\s[0-9][0-9][0-9][0-9])” part, which is supposed to return our date, matches only “h 25, 2016.”

To solve this problem, we must put the “?” character behind our first “*”; the “?” character switches the operator it attaches to into “non-greedy” mode, telling it to match to as few characters as possible. After inserting the “?,” our regular expression looks like this:

“.*?([a-zA-Z]+\s[0-9][0-9],\s[0-9][0-9][0-9][0-9]).*”

Plugging this new regular expression into our program and rerunning it, we get the following result:

This is what we want, so we’re done!
Now that you’re familiar with regular expressions, have fun applying them to your own scraping!

« PreviousNext »