Introduction to Web Scraping

Web scraping is useful for grappling many of the questions studied by political science. It can help us gather large quantities of written materials very quickly. My tutorial will give a brief explanation of how to scrape data using research by Richard Nielsen on the writings of Islamic clerics as the motivating example. A presentation version of this material is available here.

Web scraping might be useful if you’re trying to downloading many files from a website quickly, store content from a large number of authors to classify their ideology or sentiment, or archive content that might disappear from the web. My goal for this tutorial is to teach someone with no experience in web scraping the skills to carry out a simple project. I’m going to show some techniques for quickly downloading text or files from a long list of identically structured webpages.

Most advanced web scraping uses Python but I’m going to use a language we already know, R. If you want to learn to use python, Rebecca Weiss gave a great VAM presentation in 2014. But the example that she gives (scraping political party platforms) could easily be accomplished in R instead. You have better things to do with your time than learn phython if webscraping is not your primary interest. R is often good enough for what we want to do

Here’s the material that I will cover in this tutorial.

Non-Latin alphabet setup

Basic encoding issues

Working with non-Latin text brings lots of encoding problems. I illustrate these problems using Arabic text but the steps are applicable to any other non-Latin characters including Chinese, Japanese, and Cyrillic characters. Most of the time, you just need to specify that the encoding is “UTF-8” when you load or save files, i.e.

readLines(link, encoding="UTF-8")

UTF-8 encoding captures both plain English text and all other characters that are used (non-Latin letter, emojis, etc). It stands for Unicode Transformation Format. The ‘8’ means it uses 8-bit blocks to represent a character. You can check (or change) the encoding using the Encoding() function. For PCs working with foreign text, set your locale to that region, it’ll make your life much easier.

Macs

Encoding works very easily in Macs and they succesfully detect the character encoding most of the time. However, I have been unable to fix issues with the text-direction during plotting, which seperates the characters and plots them left to right.This is because there doesn’t seem to be anyway to install an Arabic locale on a Mac, unlike on a PC. But I am also relatively new to Macs so it is possible that there is a work around somewhere.

Changing your locale (for PCs)

PCs have more difficulty getting the encoding right, but once the encoding is set, plotting is normal.

This command for changing your locale just changes the setting for within R and will reset whenever you close R.

default <- Sys.getlocale(category="LC_ALL"); default  #get your current region
Sys.setlocale(category="LC_ALL", locale = "Arabic_Saudi Arabia.1256")  #change for arabic

There is a useful page for finding the right locale here.

Encoding in other programs and knitr (for PCs)

That’s enough to get R to work but its not enough to get Knitr’s output of R commands to work.

To fix this, change your computer’s locale for non-unicode programs (for me this sets Zotero to Arabic but I haven’t seen any other consequences). Note, You still have to run the command above even if you make this change.

  1. Control Panel, Region, Administrative, Change System Locale

  2. Choose your language/region

  3. Restart, delete old knitr caches etc

More resources

The rest of my tutorial will assume that your non-Latin text is UTF-8. If not, Richard Nielsen covers a ton of different encoding issues here. He also provides an Arabic text stemmer that will be useful for text analysis here. He has an example Arabic Latex file on his resources page too and I’m working on Arabic Beamer.

Review of coding languages

HTML

HTML is pretty straightforward and you don’t have to be an expert to scrape websites. Pages are basically structured like the example below.

<!DOCTYPE html>
    <html>
        <body>

            <h1>This is a heading</h1>

            <p class="notThisOne">This is a paragraph</p>

            <p class="thisOne">But I only want this paragraph</p>

        </body>
    </html>

Which creates a simple page that looks like this

This is a heading

This is a paragraph

But I only want this paragraph

Here’s another (unrelated) example for the more graphically inclined

Source: openbookproject.net

We can use the class or id property of the section to differentiate the section we want from other sections. In this example, we want the paragraph with class='thisOne' and not class='notThisOne'. But how do we get this content in R?

Regular Expressions

Regular expressions is a language to precisely define patterns in text. They will help us deal with link paths and would be crucial if you’re trying to find matches in a long piece of text (ie. the number of references to a particular term).

The wikipedia page gives a good overview: https://en.wikipedia.org/wiki/Regular_expression

Here’s a more thorough tutorial on regular expressions in R: https://stat545-ubc.github.io/block027_regular-expressions.html

Basically any pattern you want has probably already been asked on StackOverflow.

For instance,

regexpr("([^/]+$)", text) #gets the position of the last forward-slash
  • [^/] matches any single characters but not /

  • + repeats the previous match one or more times

  • $ matches the end of the string

Put it together, match anything after the last forward slash

Other examples:

regexpr("[^\\]+\\[^\\]+$", text) #gets the position of the second to last back-slash

paste(paste0("\\u",strsplit(gsub("<|>","",x), "U+",fixed=T)[[1]][-1]), collapse="") #converts U-escaped unicode string to ASCII

gsub("<[^>]*>", "", x)  #removes HTML tags

Regular expressions functions ship with base R.

  • grep() returns whether elements of x have matches

  • gsub() replaces a match in a string with predefined text

  • regexpr() gives the starting position of matches in a text

Other hints

  • If you have a function that wants a regexp pattern but you have an exact string you want it to match (i.e. you literally want “[^/]+$”), use the fixed=TRUE option.

  • Useful alongside the stringr library functions like substr() and str_extract().

  • paste() concatenates text together, and the collapse=“” option is useful for collapsing lists into one vector. paste0 has sep=“” as the default.

Review of R functions

Managing files

When you’re downloading thousands of files, you’ll want a good system to keep track of them. Here are a couple of R functions to get you started managing your files.

head(list.files())  #shows you all of the files in your current directory
## [1] "_footer.html" "_header.html" "_old"         "_person.Rmd" 
## [5] "_site.yml"    "about.html"
head(list.files(recursive=TRUE))  #recursive=TRUE shows subfiles, notice the _old folder
## [1] "_footer.html"     "_header.html"     "_old/_site.yml"  
## [4] "_old/about.Rmd"   "_old/Archive.zip" "_old/exec.R"
list.files(pattern = "\\.html$") #you can also add criteria using regexp
##  [1] "_footer.html"     "_header.html"     "about.html"      
##  [4] "collab.html"      "data.html"        "forum.html"      
##  [7] "index.html"       "interactive.html" "mapping.html"    
## [10] "nyt.html"         "pdf.html"         "research.html"   
## [13] "revealjs.html"    "rmarkdown.html"   "scraping.html"   
## [16] "sweave.html"      "teaching.html"

Here’s a function that returns a table of file types in a directory

#with magrittr
list.files(pattern="\\.") %>% #get files with '.'
    lapply(FUN=function(x) #cut the text after the '.`
        substr(x,gregexpr("([^.]+$)", x)[[1]], nchar(x))) %>% 
    unlist() %>% table() %>% as.matrix()  #unlist it, table it, matrix it
##      [,1]
## css     1
## html   17
## R       1
## Rmd    18
## yml     1

writeLines() saves text but needs the file directory to exist already. If I don’t have a folder called test, this won’t work. Notice that I’m wrapping it in try() so that the code won’t break if this directory doesn’t exist.

setwd(perm)
try(writeLines("test", paste0(getwd(),"/test/test.txt")))

I can create a folder using dir.create(). Now writeLines() will work.

setwd(perm)
dir.create(paste0(getwd(),"/test"), recursive=TRUE)
writeLines("test", paste0(getwd(),"/test/test.txt"))

Now delete this test folder using unlink. Use paste0 and getwd() to give it an absolute path instead of a relative path.

setwd(perm)
unlink(paste0(getwd(),"/test"), recursive=TRUE)

Shell programs

You can also control a lot of command line programs (shell programs) from within R to carry out more intensive tasts. For example, use the xpdf program and the tm library to load OCRed PDFs into R. There are also a lot of alternative programs for reading PDFs that you might already have installed. Find xpdf here for PC.

Install xpdf using homebrew for Mac. The steps are laid out in my Dealing with PDFs tutorial.

library(tm)  #using the text mining package but it requires an external program
#setwd("C:/Users/Vincent/Documents/R/programs/xpdf") #locate the shell program for PCs
pdffile <- "/Volumes/External HD/Website/rweb2/files/teaching/pdf.pdf"  #save the file location

#OCRed english pdf files
pdf <- readPDF(  #the tm function
    engine="xpdf",  #specify your engine
    control = list(text = "-layout"))(  #try to maintain layout
    elem = list(uri = pdffile),  #specify your pdf
    language = "en")  #suppposedly xpdf can deal with OCRed Arabic text

Here’s the content, its just random English text

head(pdf$content)  #this is a pdf with random English text
## [1] "Advantage old had otherwise sincerity dependent additions. It in adapted natural hastily is"    
## [2] "justice. Six draw you him full not mean evil. Prepare garrets it expense windows shewing do an."
## [3] "She projection advantages resolution son indulgence. Part sure on no long life am at ever. In"  
## [4] "songs above he as drawn to. Gay was outlived peculiar rendered led six."                        
## [5] ""                                                                                               
## [6] "Now eldest new tastes plenty mother called misery get. Longer excuse for county nor except"

You could also split all the pages in this PDF using the pdftk program. Find it here for PC: https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/

Or install the server version for Mac: https://www.pdflabs.com/tools/pdftk-server/

Here’s the command you’re going to send to the program.

pdfoutput <- "/Users/vincentbauer/Desktop/test" #specify the output directory

dir.create(pdfoutput)  #create the output directory, otherwise this will not work

#create the call, the "_%02d" appends numbers to the file names
pdfcall <- paste0("pdftk ", pdffile, " burst output ",pdfoutput,"/pdf_%02d.pdf")  
pdfcall  #show the call
## [1] "pdftk /Volumes/External HD/Website/rweb2/files/teaching/pdf.pdf burst output /Users/vincentbauer/Desktop/test/pdf_%02d.pdf"

You need to set the working directory to the program folder (at least for PCs), then run the shell command.

#for pcs its easiest to change the working directory to the program location
#setwd("C:/Users/Vincent/Documents/R/programs/pdftk")

#shell(pdfcall)  #run the call for PC
system(pdfcall)
## Warning in system(pdfcall): error in running command
list.files(pdfoutput)  #show the files, there are two pages
## character(0)
unlink("/Users/vincentbauer/Desktop/test", recursive=TRUE)  #cleaning up

Rvest

rvest library

The rvest library provides great functions for parsing HTML and the function we’ll use the most is called html_nodes(), which takes an parsed html and a set of criteria for which nodes you want (either css or xpath).

A generic version of the function looks like this html_nodes(document, css)

  • document is just a html document that has been parsed by read_html()

  • css is the node that we want in the document
    • Put a # before ids
    • Put a . before classes

Then follow this with either html_text() to get the text or xml_attr(,"href") to get the link.

xpath selectors

Most of the time we can just use the css but in some cases we need to use the xpath selector, especially when the node we want doesn’t have a unique css identifier.

We can either:

  • Get all of the elements that match as a list and then select the one we want (use [[i]] to get the vector)

  • Use the xpath selector, which has a more complicated syntax

To use the xpath selector

  • Type two forward-slashes and then the section tag, i.e. "//p".

  • Add further critieria using brackets and an ampersand, i.e. "//p[@class='this one']".

  • Your criteria can also include references to preceeding sections, such as paragraphs that follow immediately after a heading; seperate sections with a forward-slash, i.e. "//h1/p".

Note, rvest sometimes requires that you specify "//h1/following-sibling::p" for preceeding sections.

rvest and encodings

rvest does not do a great job with encodings. As far as I can tell the encoding="UTF-8" option doesn’t do anything at all when reading in html files. But there are a few work arounds.

  • For text, send the output to repair_encoding() before displaying or analyzing

  • For tables, send the table to type_convert() from the readr library before displaying or analyzing. This uses your locale to determine encoding.

The xpathApply() function in the XML library is much better at dealing with encoding so just use that if you have trouble with rvest, although its more complicated to work with. There are extra slides at the end.

This Japanese page gave me a lot of direction for encoding in rvest: http://qiita.com/rmecab/items/9612f55097fa2f6701d8

Example of HTML and rvest

First, parse the link to the html code above with read_html()

library(rvest)
setwd(perm)

document <- read_html(paste0(getwd(), "/files/teaching/example.htm"))

Then, we can get the node we want using html_node with the "h1" class and then the html_text() function. You can either wrap the second function around the first or you can use magrittr which allows you to push the output of one function to the first parameter of the second function.

html_text(html_node(document, "h1"))
## [1] "This is a heading"
library(magrittr)
html_node(document, "h1") %>% html_text()
## [1] "This is a heading"

To get the text from the second paragraph we could either use its class.

html_node(document, ".thisOne") %>% html_text() #getting the value of the 2nd paragraph
## [1] "But I only want this paragraph"

Or we could use the unnecessary xpath

html_node(document, xpath="//h1/following-sibling::p/following-sibling::p") %>% html_text() 
## [1] "But I only want this paragraph"

Sometimes it can get slightly more complicated but those are the building blocks you need.

Side note: forms

All of my examples will cover getting data from a static page but you can also submit forms using submit_form() from rvest and then parse the response. There’s a good tutorial here: http://stat4701.github.io/edav/2015/04/02/rvest_tutorial/

Alternative: wget

The alternative to parsing through the HTML is to use a program to just pull everything down from the website.

wget is a great shell program that you can control from inside R. It works well if you want to copy the entire site, like banners and styling, and it’ll create a mirrored backup on your computer that functions just like the original website.

But wget is really indiscriminate and requires a lot of searching through your local material to do any text analysis anyways.

Here’s the website if you want to try it out: http://www.gnu.org/software/wget/

This is a great tutorial: http://www.thegeekstuff.com/2009/09/the-ultimate-wget-download-guide-with-15-awesome-examples/

This is how you would run it after installing

link <- "url"
shell(paste("wget -mkEpnp -e robots=off",link))

-mkEpnp -e robots=off is a useful set of options,

  • ‘m’ means mirror (create a local version; infinite recursive depth)

  • ‘k’ means convert links (helps with mirroring)

  • ‘E’ means adjust extension (helps with mirroring)

  • ‘p’ means page requisites (helps with mirroring)

  • ‘np’ means no parent (don’t go higher than the initial link)

  • ‘e robust=off’ restricts its from pages that the host doesn’t want robots on

Additional material

xpathApply()

The xpathApply() functions in the XML library are a little more complicated to use than the rvest functions (unnecessarily so) but they do deal with encoding better (avoiding repair_encoding() or type_convert()).

xpathApply(), which takes an parsed html (done by htmlTreeParse()) and a set of criteria for which nodes you want. It can return a couple of different properties of that chunk, including the text and links. To get the text use the xmlValue property, to get the links use the xmlGetAttr, 'href' property. I’ll give an example below and it will come up a lot in the tutorial.

Notice that xpathApply is an apply function so it is vectorized and it returns lists. Quick reminder for dealing with list elements; [i] returns a list, [[i]] returns a vector, which is usually what we want.

A generic version of the function looks like this xpathApply(document, path, properties)

  • document is just a html document that has been parsed by htmlTreeParse()

  • path is the node that we want in the document. Just type two forward-slashes and then the letters from the beginning of the section tag, i.e. "//p". Add further critieria using brackets and an ampersand, i.e. "//p[@class='this one']". Finally, your criteria can include references to preceeding sections, such as paragraphs that follow immediately after a heading; seperate sections with a forward-slash, i.e. "//h1/p".

  • properties specify what output you want. xmlValue gives you the text of the section,xmlGetAttr, "href" gives you the links.

Example of HTML and xpathApply()

First, parse the html code above with htmlTreeParse

document <- htmlTreeParse(link, useInternal=TRUE)  #useInternal makes it a little faster

Then, we can get the text from the heading using the following code which specifies the heading (“//h1”) and that we want the text (xmlValue)

xpathApply(document, "//h1", xmlValue)  #getting the value of the heading section
## [1] "This is a heading"

To get the text from the second paragraph we could either use its class.

xpathApply(document, "//p[@class='thisOne']", xmlValue)  #getting the value of the paragraph section we want
## [1] "But I only want this paragraph"

Or we could specify its position as the second paragraph following a heading. Obviously this is a worse strategy.

xpathApply(document, "//h1/p/p", xmlValue)  #getting the value of the paragraph section we want
## [1] "But I only want this paragraph"

Sometimes it can get slightly more complicated but those are really all the stepping stones you need.