Introduction to Web Scraping in R

Vincent Bauer

Very Applied Methods Workshop
Department of Political Science, Stanford University

April 1st, 2016

These slides are posted on my Stanford page if you want to follow along: http://stanford.edu/~vbauer/VAMScrapingSlides.html

Hint, press the letter O to get an overview.

I also have a full page tutorial here that has extra code annotations and comments: http://stanford.edu/~vbauer/VAMScrapingTutorial.html

If you have any more suggestions for this presentation I’d be happy to incorporate them.

Presentation Overview

Web scraping

  • Web scraping is the use of software to extract information from websites

  • Includes both supervised and unsupervised methods

  • Turns the internet into a source of potential data for many different research projects

When web scraping might be useful

  • Downloading many files from a website quickly

  • Store content from a large number of authors to classify their ideology or sentiment

  • Archive content that might disappear from the web

  • etc

Objective

My goal for this presentation is to teach someone with no experience in web scraping the skills to carry out a simple project by the end of the VAM.

I’m going to show some techniques for quickly downloading text or files from a long list of identically structured webpages.

What I’m assuming

  • Intermediate knowledge of R

  • Basic knowledge of HTML/CSS

  • No knowledge of python

  • You know what websites/material you want to scrape already

What we’ll cover today

  • Dealing with non-Latin alphabets and character encoding

  • Lightning fast overview of HTML and Regular Expressions

  • Functions for parsing HTML and saving particular elements

  • Using R and shell programs to manage files

  • Supervised methods to scrape text and files

  • Intro to Twitter and Facebook APIs

What I’m not covering

  • ‘Big data’ web scraping/spiders (i.e. unsupervised methods)

  • What to do with your material after you save it (i.e. text analysis)

  • More advanced uses of Twitter and Facebook APIs

Why R and not Python?

Most advanced web scraping uses python but I’m going to use a language we already know, R.

If you want to learn to use python, Rebecca Weiss gave a great VAM presentation in 2014: https://github.com/rjweiss/VAM-Python.

But the example that she gives (scraping political party platforms) could easily be accomplished in R instead.

Pros of R for web scraping

  1. Learning phython is hard, better things to learn if text is not your primary interest

  2. For transparency, your co-authors and replicators may not know python even if you do

  3. Some advantages to keeping all of your analysis in the same language

Cons of R for web scraping

  1. Python is faster

  2. Many specifically designed python tools for web scraping

Short story, R is often good enough for what we want to do

Non-Latin alphabet setup

Basic encoding issues

Working with non-Latin text brings lots of encoding problems. I illustrate these problems using Arabic text but the steps are applicable to any other non-Latin characters including Chinese, Japanese, and Cyrillic characters.

  • Most of the time, you just need to specify that the encoding is “UTF-8” when you load or save files, i.e. readLines(link, encoding="UTF-8").

  • UTF-8 encoding captures both plain English text and all other characters that are used (non-Latin letter, emojis, etc)

  • UTF stands for Unicode Transformation Format. The ‘8’ means it uses 8-bit blocks to represent a character.

  • You can check (or change) the encoding using the Encoding() function. There are a few other encoding types too.

  • For PCs working with foreign text, set your locale to that region, it’ll make your life much easier.

Changing your locale (for PCs)

This command for changing your locale just changes the setting for within R and will reset whenever you close R.

default <- Sys.getlocale(category="LC_ALL"); default  #get your current region
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
Sys.setlocale(category="LC_ALL", locale = "Arabic_Saudi Arabia.1256")  #change for arabic
## [1] "LC_COLLATE=Arabic_Saudi Arabia.1256;LC_CTYPE=Arabic_Saudi Arabia.1256;LC_MONETARY=Arabic_Saudi Arabia.1256;LC_NUMERIC=C;LC_TIME=Arabic_Saudi Arabia.1256"

Here’s a useful page for finding the right locale: https://docs.moodle.org/dev/Table_of_locales

Encoding in other programs and knitr (for PCs)

That’s enough to get R to work but its not enough to get Knitr’s output of R commands to work.

  • To fix this, change your computer’s locale for non-unicode programs (for me this sets Zotero to Arabic but I haven’t seen any other consequences). Note, You still have to run the command above even if you make this change.
  1. Control Panel, Region, Administrative, Change System Locale

  2. Choose your language/region

  3. Restart, delete old knitr caches etc

More resources

The rest of my tutorial will assume that your non-Latin text is UTF-8.

If not, Richard Nielsen covers a ton of different encoding issues here: http://www.mit.edu/~rnielsen/Working%20with%20Unicode%20in%20R.txt.

He also provides an Arabic text stemmer that will be useful for text analysis next month here: http://www.mit.edu/~rnielsen/helpful.htm

He has an example Arabic Latex file on his resources page too and I’m working on Arabic Beamer that I’d be happy to share if anyone is interested.

I would also be happy to go over encoding in more depth later.

Quick review of coding languages and functions

HTML

HTML is pretty straightforward and you don’t have to be an expert to scrape websites. Pages are basically structured like the example below.

<!DOCTYPE html>
<html>
<body>

<h1>This is a heading</h1>

<p class="notThisOne">This is a paragraph</p>

<p class="thisOne">But I only want this paragraph</p>

</body>
</html>

Which creates a beautiful page that looks like this

Notice that content is always part of little sections, such as the header, which is everything between <h1> and </h1>.

<!DOCTYPE html>
<html>
<body>

<h1>This is a heading</h1>

<p class="notThisOne">This is a paragraph</p>

<p class="thisOne">But I only want this paragraph</p>

</body>
</html>

Here’s another version for the more graphically inclined

Credit: openbookproject.net

We can use the class or id property of the section to differentiate the section we want from other sections.

In this example, we want the paragraph with class='thisOne' and not class='notThisOne'.

<!DOCTYPE html>
<html>
<body>

<h1>This is a heading</h1>

<p class="notThisOne">This is a paragraph</p>

<p class="thisOne">But I only want this paragraph</p>

</body>
</html>

But how do we get this content in R?

rvest library

The rvest library provides great functions for parsing HTML and the function we’ll use the most is called html_nodes(), which takes an parsed html and a set of criteria for which nodes you want (either css or xpath).

A generic version of the function looks like this html_nodes(document, css)

  • document is just a html document that has been parsed by read_html()

  • css is the node that we want in the document
    • Put a # before ids
    • Put a . before classes

Then follow this with either html_text() to get the text or xml_attr(,"href") to get the link.

xpath selectors

Most of the time we can just use the css but in some cases we need to use the xpath selector, especially when the node we want doesn’t have a unique css identifier.

We can either:

  • Get all of the elements that match as a list and then select the one we want (use [[i]] to get the vector)

  • Use the xpath selector, which has a more complicated syntax

To use the xpath selector

  • Type two forward-slashes and then the section tag, i.e. "//p".

  • Add further critieria using brackets and an ampersand, i.e. "//p[@class='this one']".

  • Your criteria can also include references to preceeding sections, such as paragraphs that follow immediately after a heading; seperate sections with a forward-slash, i.e. "//h1/p".

Note, rvest sometimes requires that you specify "//h1/following-sibling::p" for preceeding sections.

rvest and encodings

rvest does not do a great job with encodings. As far as I can tell the encoding="UTF-8" option doesn’t do anything at all when reading in html files. But there are a few work arounds.

  • For text, send the output to repair_encoding() before displaying or analyzing

  • For tables, send the table to type_convert() from the readr library before displaying or analyzing. This uses your locale to determine encoding.

The xpathApply() function in the XML library is much better at dealing with encoding so just use that if you have trouble with rvest, although its more complicated to work with. There are extra slides at the end.

This Japanese page gave me a lot of direction for encoding in rvest: http://qiita.com/rmecab/items/9612f55097fa2f6701d8

Example of HTML and rvest

First, parse the link to the html code above with read_html()

library(rvest)
document <- read_html("http://stanford.edu/~vbauer/example.html") 

Then, we can get the node we want using html_node with the "h1" class and then the html_text() function. You can either wrap the second function around the first or you can use magrittr which allows you to push the output of one function to the first parameter of the second function.

html_text(html_node(document, "h1"))
## [1] "This is a heading"
library(magrittr)
html_node(document, "h1") %>% html_text()
## [1] "This is a heading"

To get the text from the second paragraph we could either use its class.

html_node(document, ".thisOne") %>% html_text() #getting the value of the 2nd paragraph
## [1] "But I only want this paragraph"

Or we could use the unnecessary xpath

html_node(document, xpath="//h1/following-sibling::p/following-sibling::p") %>% html_text() 
## [1] "But I only want this paragraph"

Sometimes it can get slightly more complicated but those are the building blocks you need.

Side note: forms

All of my examples will cover getting data from a static page but you can also submit forms using submit_form() from rvest and then parse the response.

There’s a good tutorial here: http://stat4701.github.io/edav/2015/04/02/rvest_tutorial/

Regular Expressions

Regular expressions is a language to precisely define patterns in text. They will help us deal with link paths and would be crucial if you’re trying to find matches in a long piece of text (ie. the number of references to a particular term).

The wikipedia page gives a good overview: https://en.wikipedia.org/wiki/Regular_expression

Basically any pattern you want has probably already been asked on StackOverflow.

For instance,

regexpr("([^/]+$)", text) #gets the position of the last forward-slash
  • [^/] matches any single characters but not /

  • + repeats the previous match one or more times

  • $ matches the end of the string

Put it together, match anything after the last forward slash

Other examples:

regexpr("[^\\]+\\[^\\]+$", text) #gets the position of the second to last back-slash

paste(paste0("\\u",strsplit(gsub("<|>","",x), "U+",fixed=T)[[1]][-1]), collapse="") #converts U-escaped unicode string to ASCII

Regular expressions functions ship with base R.

  • grep() returns whether elements of x have matches

  • gsub() replaces a match in a string with predefined text

  • regexpr() gives the starting position of matches in a text

Other hints

  • If you have a function that wants a regexp pattern but you have an exact string you want it to match (i.e. you literally want “[^/]+$”), use the fixed=TRUE option.

  • Useful alongside the stringr library functions like substr() and str_extract().

  • paste() concatenates text together, and the collapse=“” option is useful for collapsing lists into one vector. paste0 has sep=“” as the default.

Managing files

When you’re downloading thousands of files, you’ll want a good system to keep track of them.

Here are a couple of R functions to get you started managing your files.

head(list.files())  #shows you all of the files in your current directory
## [1] "custom.css"            "example code.R"        "img"                  
## [4] "pdf.pdf"               "response1.txt"         "ScrapingTutorial.html"
head(list.files(recursive=TRUE))  #recursive=TRUE shows subfiles, notice the img folder
## [1] "custom.css"          "example code.R"      "img/example.png"    
## [4] "img/pageInspect.png" "img/pageLevel2.png"  "img/pageLevel3.png"
 list.files(pattern = "\\.txt$") #you can also add criteria using regexp
## [1] "response1.txt" "table.txt"

Here’s a function that returns a table of file types in a directory

#with magrittr
list.files(pattern="\\.") %>% #get files with '.'
    lapply(FUN=function(x) #cut the text after the '.`
        substr(x,gregexpr("([^.]+$)", x)[[1]], nchar(x))) %>% 
    unlist() %>% table() %>% as.matrix()  #unlist it, table it, matrix it
##      [,1]
## css     1
## html    2
## md      1
## pdf     1
## R       1
## Rmd     2
## txt     2
#normal way,
as.matrix(table(  #make a table
    unlist(lapply(  #lapply to the list.file function
        list.files(pattern="\\."), FUN=function(x)  #only files, not directories (needs to have ".")
            substr(x,gregexpr("([^.]+$)", x)[[1]], nchar(x))))))  #return everything after the last "."

writeLines() saves text but needs the file directory to exist already. If I don’t have a folder called test, this won’t work. Notice that I’m wrapping it in try() so that the code won’t break if this directory doesn’t exist.

try(writeLines("test", paste0(getwd(),"/test/test.txt")))
## Warning in file(con, "w"): cannot open file 'C:/Users/Vincent/SkyDrive/
## Desktop/Graduate School/1.Lectures/R Scraping/test/test.txt': No such file
## or directory

I can create a folder using dir.create(). Now writeLines() will work.

dir.create(paste0(getwd(),"/test"), recursive=TRUE)
writeLines("test", paste0(getwd(),"/test/test.txt"))

Now delete this test folder using unlink. Use paste0 and getwd() to give it an absolute path instead of a relative path.

unlink(paste0(getwd(),"/test"), recursive=TRUE)

Shell programs

You can also control a lot of command line programs (shell programs) from within R to carry out more intensive tasts.

For example, use the xpdf program and the tm library to load OCRed PDFs into R.

library(tm)  #using the text mining package but it requires an external program
setwd("C:/Users/Vincent/SkyDrive/Documents/R/programs/xpdf") #locate the shell program
pdffile <- "C:/Users/Vincent/SkyDrive/Desktop/pdf.pdf"  #save the file location

#OCRed english pdf files
pdf <- readPDF(  #the tm function
    engine="xpdf",  #specify your engine
    control = list(text = "-layout"))(  #I forget what this does
    elem = list(uri = pdffile),  #specify your pdf
    language = "en")  #suppposedly xpdf can deal with OCRed Arabic text if you install some supplementary files

Find xpdf here: http://www.foolabs.com/xpdf/download.html. There are also a lot of alternative programs.

Here’s the content, its just random English text

head(pdf$content)  #this is a pdf with random English text
## [1] "Game of as rest time eyes with of this it. Add was music merry any truth since going. Happiness she ham"  
## [2] "but instantly put departure propriety. She amiable all without say spirits shy clothes morning. Frankness"
## [3] "in extensive to belonging improving so certainty. Resolution devonshire pianoforte assistance an he"      
## [4] "particular middletons is of. Explain ten man uncivil engaged conduct. Am likewise betrayed as declared"   
## [5] "absolute do. Taste oh spoke about no solid of hills up shade. Occasion so bachelor humoured striking by"  
## [6] "attended doubtful be it."

You could also split all the pages in this PDF using the pdftk program. Find it here: https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/

This program is really finicky. You can’t use it with folders with spaces in their name and you can’t wrap directory names in ’s.

pdfoutput <- "C:/Users/Vincent/SkyDrive/Desktop/test" #specify the output directory

dir.create(pdfoutput)  #create the output directory

#create the call, the "_%02d" appends numbers to the file names
pdfcall <- paste0("pdftk ", pdffile, " burst output ",pdfoutput,"/pdf_%02d.pdf")  
pdfcall  #show the call
## [1] "pdftk C:/Users/Vincent/SkyDrive/Desktop/pdf.pdf burst output C:/Users/Vincent/SkyDrive/Desktop/test/pdf_%02d.pdf"

You need to set the working directory to the program folder (at least for PCs), then run the shell command.

setwd("C:/Users/Vincent/SkyDrive/Documents/R/programs/pdftk")

shell(pdfcall)  #run the call
list.files(pdfoutput)  #show the files, there are two pages
## [1] "pdf_01.pdf" "pdf_02.pdf"
unlink("C:/Users/Vincent/SkyDrive/Desktop/test", recursive=TRUE)  #cleaning up

Alternative: wget

The alternative to parsing through the HTML is to use a program to just pull everything down from the website.

wget is a great shell program that you can control from inside R. It works well if you want to copy the entire site, like banners and styling, and it’ll create a mirrored backup on your computer that functions just like the original website.

But wget is really indiscriminate and requires a lot of searching through your local material to do any text analysis anyways.

Here’s the website if you want to try it out: http://www.gnu.org/software/wget/

This is a great tutorial: http://www.thegeekstuff.com/2009/09/the-ultimate-wget-download-guide-with-15-awesome-examples/

This is how you would run it after installing

shell(paste("wget -mkEpnp -e robots=off",link))

-mkEpnp -e robots=off is a useful set of options,

  • ‘m’ means mirror (create a local version; infinite recursive depth)

  • ‘k’ means convert links (helps with mirroring)

  • ‘E’ means adjust extension (helps with mirroring)

  • ‘p’ means page requisites (helps with mirroring)

  • ‘np’ means no parent (don’t go higher than the initial link)

  • ‘e robust=off’ restricts its from pages that the host doesn’t want robots on

Web scraping in Political Science

For example, Richard Nielsen at MIT studies why some clerics in the Islamic world adopt jihadist ideologies while others hold more moderate beliefs.

  • To carry out this research, he must clasify clerics as holding either jihadist or moderate beliefs.

  • To accomplish this goal, he scrapes the writings of a large number of clerics from their personal websites, forum pages, Facebook pages, and Twitter feeds.

  • Then he compares the content of these writings against a training sample of clerics with known ideology.

My tutorial will give a brief explanation of how to scrape data using this case as the motivating example.

More specifically, I will cover

  1. Downloading biographical information and fatwas from a cleric’s page

  2. Downloading tweets by Khamenei and the #MashaAllah hashtag

  3. Downloading the most recent posts on a cleric’s facebook page

Example 1. Scraping cleric pages

Getting started

Suppose we have a list of a few hundred Muslim clerics on a particular forum and we want to download their biographical information (CV) and their formal legal opinions on religious matters (think Ask Amy; also called fatwas, rulings, opinions, and consultations).

Each cleric only has one CV but each has somewhere between ten and a hundred fatwas, so we’re talking probably 10,000 individual pages, far too many to do by hand.

We’re interested in two things

  1. Their CV, listed on this first page as part of a table.

  2. Their fatwas, listed on subsequent pages by following the
    “عرض الإستشارات”/“View Consultations” link.

Here’s an example of one of the webpages we want to scrape: http://islamtoday.net/istesharat/schcv-3240.htm.

Here’s the translated page

We’re going to have to dig a few levels deep into the website to get to the fatwas. wget would not perform this task well because these linked pages are not technically subpages in the website’s directory (they look the same as links heading back to the home page etc)

Before we do anything, we want to learn how this webpage is put together.

  • In Chrome, right click on the CV area and hit ‘Inspect’.

  • Move around on the side bar until you find the piece that highlights the CV but nothing more. Notice that this is a table and is given the id ‘table1’.

  • Right click on the link to the fatwas and hit ‘Inspect’. Notice that this part is given the id ‘ctl00_ContentPlaceHolder1_…’, we will use this to pull it down.

  • Also notice that the link to the next page is not in a subdirectory of this page, so wget won’t work well.

  • We could also look through the html by right clicking anywhere on the page and selecting “View Page Source”.

Download the CV

First we want to download the CV table on the initial page.

We could save all of the text on this first page but we actually just want the CV table and not any of the other words on the page. Also, it would be great if this CV was formatted as a table that we could work with.

The html_table() function is perfect for this job.

  • Load in your link(s). In the real world you would probably be loading in a list of links and then iterating this process over all of them.
link.list <- list("http://islamtoday.net/istesharat/schcv-3240.htm")

#for(i in 1:length(link.list)){
i = 1; link <- link.list[[i]]  #store the link for this particular iteration

  • Use the read_html() function to parse the links, then select the right table node with html_nodes(), finally convert this table into a dataframe with html_table().
library(rvest)  #for html parsing

doc.html <- read_html(link)  #there's no point to specifying the encoding
html_nodes(doc.html,"table") #check the names of all the tables on this page
## {xml_nodeset (2)}
## [1] <table class="table borderless" cellspacing="0" cellpadding="0">&#13 ...
## [2] <table id="table1" dir="rtl" style="BORDER-TOP-WIDTH: 0px; BORDER-LE ...
table <- html_table(html_nodes(doc.html,"table")[[2]])

library(readr)  #for type_convert

head(type_convert(table))
##                       X1                                           X2
## 1 اللقب العلمي والوظيفة: مدرس متفرغ بأكاديمية المقطم للعلوم الحديثة .
## 2          جامعة التخرج:     جامعة الأقصى غزة وجامعية عين شمس القاهرة
## 3           كلية التخرج:             كلية التربية قسم العلوم الفلسفية
## 4          التخصص العام:                                    صحة نفسية
## 5          لتخصص الدقيق:                                 غير العاديين
## 6          مكان الميلاد:                                 الإسكندرية .
  • But the website isn’t quite set up as a normal table (there are a few long merged rows) so cut and paste those rows.
#library(knitr)  #kable function

#the data in row 9 column 1 should actually be row 8 column 2
table[8,2] <- table[9,1]  #switch
table <- table[-9,]  #drop 

#kable(table)

  • Finally, save the table as a text file in the working directory, in the real world we’d set up a folder structure to organize these files.

  • Notice that I’m using the fileEncoding="UTF-8" option and saving as a text file.

setwd("C:/Users/Vincent/SkyDrive/Desktop/Graduate School/1.Lectures/R Scraping")

#save as txt because we want to be able to analyze it easily later
write.table(table, "table.txt", sep=",", quote=TRUE, fileEncoding="UTF-8")

Downloading Fatwas

Next, we’re going to have to dig a few levels into the website to get the fatwas that this particular cleric has written.

These fatwas are advice that the cleric has given to a particular issue raised by someone who has a question. Think Ask Amy.

In the example that I’m going to show, a young man is asking whether he can marry a foreign Muslim girl he met over the internet and the cleric is saying that that’s haram.

Once we figure out the steps for any given page it will work for any other cleric and any other ruling as well.

These are the steps we’re going to follow again

The First Page

First we need to programmatically find the link to the second page on the first page.

This is the page source for the first page, showing the id for the link from the first page to the second page.

<tr>
    <td class="active">
     
     <h5 class="nopadding">المشاركات</h5>
     </td>
</tr>
<tr>
    <td class="FatwaLeftSideCell" style="text-align:right">
    <a id="ctl00_ContentPlaceHolder1_SchList1_Repeater1_ctl00_HyperLink1" href="schques-70-3240-1.htm">عرض الإستشارات</a>
    </td>
</tr>
<tr>
    <td colspan="2" class="active">
    <h5 class="nopadding"> السيرة الذاتية</h5>
     </td>
</tr>

Use the html_nodes function to get the link to the second page using the id that we just identified

follow0 <- html_nodes(doc.html,css="#ctl00_ContentPlaceHolder1_SchList1_Repeater1_ctl00_HyperLink1")

Get links with the xml_attr function

follow <- xml_attr(follow0, "href")
follow
## [1] "schques-70-3240-1.htm"

We could also get the text or other properties if we wanted. Using the repair_encoding function from rvest to display correctly.

repair_encoding(html_text(follow0)); rm(follow0)
## Best guess: UTF-8 (100% confident)
## [1] "عرض الإستشارات"

  • But the link it returns is relative to the base page, and we need to create an absolute link. Replace the end of the base link with our new link.
follow1 <- gsub("([^/]+$)", follow, link)  #replace after the last slash with follow link

Here’s some more code to show how this works step by step

link #remember our original link
## [1] "http://islamtoday.net/istesharat/schcv-3240.htm"
gsub("([^/]+$)", "REPLACE", link)  #replace everything after the last / with "REPLACE"
## [1] "http://islamtoday.net/istesharat/REPLACE"
gsub("([^/]+$)", follow, link) #now instead of "REPLACE" use the relative link 
## [1] "http://islamtoday.net/istesharat/schques-70-3240-1.htm"

The Second Page

Now follow this new link and get the html of the second page. This brings us to a page with links to a number of fatwas, now get a list of these links and iterate through them.

  • Parse the html for this second page, and then get a list of all of the links on this second page that lead to the third page.
#level 2

doc.html <- read_html(follow1)
follow2 <- html_nodes(doc.html, ".QuesListTitleTxt")
follow2 <- xml_attr(follow2, "href")
follow2
## [1] "/istesharat/quesshow-70-174239.htm"
## [2] "/istesharat/quesshow-70-174199.htm"
## [3] "/istesharat/quesshow-70-174113.htm"
## [4] "/istesharat/quesshow-70-150445.htm"

  • Now change these relative paths to absolute paths.

  • Notice that this time we’re getting a list of links instead of a single link so I’m using lapply() to run the gsub() function on all of the links.

follow2 <- lapply(follow2, FUN= function(x) gsub("(/[^/]+/[^/]+$)", x, link)) #regex 2nd to last forward slash
unlist(follow2)
## [1] "http://islamtoday.net/istesharat/quesshow-70-174239.htm"
## [2] "http://islamtoday.net/istesharat/quesshow-70-174199.htm"
## [3] "http://islamtoday.net/istesharat/quesshow-70-174113.htm"
## [4] "http://islamtoday.net/istesharat/quesshow-70-150445.htm"

The Third Page

Now, follow the links to the third page with the actual fatwas.

The table with the fatwas is called ‘table table-condensed’ but we want only the cleric’s words and not the people asking for advice. So only pull down the row that has his answer (the second table row with the class ‘article’).

  • Parse the third page and get the second row of the table with the ‘article’ class.
#for(i in 1:length(follow2)){
i <- 1; follow3 <- unlist(follow2[i])  #get a specific link
doc.html <- read_html(follow3)  #follow the link

response <- html_nodes(doc.html, ".article")[[2]]  #get the second row
response <- html_text(response)  #get the text

print(substr(repair_encoding(response),1,500))  #show the text
## Best guess: UTF-8 (100% confident)
## [1] "\r\n                        \r\n                            الحمد لله والصلاة والسلام على رسول الله، وبعد:\r\nأخي الكريم أنت تعرف أن حديثك مع هذه الفتاه يعتبر حراما، وسوف أوضح لك لماذا يعتبر هذا الحديث حراما: إن الحديث بينكما صورة من صور الخلوة بين الرجل والمرأة الأجنبية، وهناك التحذير الشديد \"ما خلا رجل بامرأة أجنبية إلا وكان الشيطان ثالثهما\".\r\n- هل هذه التي تحادثها هل هي أختك أو زوجتك أو إحدى المحرمات عليك، أو هي فتاة أجنبية تعرفت عليها من خلال الشبكة العنكبوتية، بلا شك أنها أجنبية عنك يبدأ الحديث ف"

Finally, write this text to a text file in our working directory. In the real world we’d give it a more identifiable name like the cleric and the topic. The useBytes=TRUE option is necessary for the encoding to work right.

writeLines(response, paste0("response", i, ".txt"), useBytes=TRUE) 

Downloading files

You can also download files from a page using download.file().

Here’s a page from which I want to download a pdf.

Here’s how we would download the pdf

  • Parse the page, isolate the link, run download.file()
#an example from another website where we want to download a pdf
link <- "http://ar.islamway.net/book/20143/%D8%A7%D9%84%D8%AA%D8%B9%D9%84%D9%8A%D9%82-%D8%A7%D9%84%D8%B3%D9%86%D9%8A-%D8%B9%D9%84%D9%89-%D8%B5%D8%AD%D9%8A%D8%AD-%D9%85%D8%B3%D9%84%D9%85-%D8%A8%D8%B4%D8%B1%D8%AD-%D8%A7%D9%84%D9%86%D9%88%D9%88%D9%8A?ref=s-pop"

doc.html <- read_html(link)

#notice that the class is actually part of the table row before the <a> element
#for some reason the following-sibling is not necessary here
pdflink <- xml_attr(html_node(doc.html, xpath="//td[@class='download-cell']/a"), "href")

#mode="wb" was key to making this work
download.file(url=pdflink, destfile=paste0(getwd(), "/pdf.pdf"), method="internal", mode="wb")  

Summmary

That seems like a lot of steps but now we’d be able to run this loop over a long list of cleric pages and get very specific content from all of them.

Go get a coffee and let your computer do all the work.

Example 2. Twitter

Twitter has an API that is very easy to use but you only have access to the most recent tweets and to a small number of calls: see some of the limitations here: https://dev.twitter.com/rest/public/rate-limiting.

I’m going to cover some skills to

  1. Setting up the twitteR functions

  2. Scraping the most recent tweets from a profile

  3. Scraping tweets mentioning a profile

Setting it up

Setting up the authentication is the hardest part, after that getting some text is very easy. If you allow it to cache the authentication you won’t have to follow these steps again.

Here’s my page

  1. Set up a Twitter account (or use an existing account)

  2. Create an app here: https://apps.twitter.com/. Put anything under ‘Website’ (I used google), and http://127.0.0.1:1410 as the Callback URL.

  3. Get your API key and API Secret from the “Keys and Access Tokens” page, they should be the first two options. Save these as key and secret in R.

  4. Load the twitteR library and run the setup_twitter_oauth() function.

  5. It will ask you whether you want to cache your access credentials, say Yes.

  6. It will popup a page to authorize your computer, sign in.

key <- key  #I hid mine from you
secret <- secret  #I hid mine from you

library(twitteR)
setup_twitter_oauth(key,secret)
## [1] "Using browser based authentication"

userTimeline

Now lets find some tweets.

Suppose we want to get the information from a list of profiles. Lets see what Khamenei, the Supreme Leader of Iran, has been saying recently.

The command userTimeline scrapes a ton of information from their page, and we can tell it to exclude retweets and replies.

page <- list("Khamenei_ir")
#for(i in 1:length(page)){
i <-1 ; currentpage <- page[[i]]
tweets <- userTimeline(currentpage, n=100, includeRts= FALSE, excludeReplies = TRUE) 

twListToDF turns these tweets into a dataframe.

tweets <- twListToDF(tweets)

head(tweets[,c("text")])  #here are the most recent tweets
## [1] "Since its birth, Islamic Republic has grown in 37 years despite heavy military &amp; propaganda campaigns, sanctions, etc. #IslamicRepublicDay"  
## [2] "I congratulate the Eid of birthday anniversary of Prophet’s daughter Hazrat Fatima Zahra. https://t.co/e7lamONmxC"                               
## [3] "When we deal on paper but sanctions remain and there’s no trade means something is wrong. 2/2"                                                   
## [4] "In talks we must be strong and negotiate in a way not to be deceived.1/2"                                                                        
## [5] "Islamic Republic must use all means. I support political talks in global issues, but not with everyone. Today is era of both missile &amp; talk."
## [6] "Enemy is working against belief in #Islam, belief in efficiency of Islamic Establishment &amp; belief in possibility of durability of it."

searchTwitter

Alternatively, we might want to find anyone who is feeling particularly blessed using the #MashaAllah hashtag. Include multiple queries using the + seperator.

We can extract a lot of information, including the text itself.

tweets <- searchTwitter('#MashaAllah', n=100)

txt = sapply(tweets, function(x) x$getText())
head(txt)
## [1] "My nephew Noah <ed><U+00A0><U+00BD><ed><U+00B2><U+0095> stayin neutral #batmanvssuperman #fam #cutebabies #MashaAllah #supernoah https://t.co/yHc31AOZZg"
## [2] "#MashaAllah Respect <ed><U+00AE><U+00BA><ed><U+00BC><U+0099> https://t.co/Vy2POsmUPQ"                                             
## [3] "MashaAllah bol kar agey share karein ..\n#MashaAllah https://t.co/frIHhGaWFt"                                                     
## [4] "#tbt 25 years ago today <ed><U+00A0><U+00BD><ed><U+00B8><U+008D><ed><U+00A0><U+00BD><ed><U+00B8><U+008D><U+2764><U+FE0F><U+2764><U+FE0F> #loveatfirstsight #MashaAllah \n\nTam… https://t.co/IAee8FK51v"
## [5] "#tb #2k11 #LoveMyNiece  #MaShaAllah https://t.co/plJCBQs8bJ"                                                                      
## [6] "u only know what is good but He knows whats best for u #mashaAllah"

Check what other information we can get from these tweets with ?status

Map retweets

We can also look for patterns in the retweets and make an extraneous figure.

Search the text for string patterns indicating a retweet

rt = grep("(RT|via)((?:\\b\\W*@\\w+)+)", txt, ignore.case=TRUE)
rt  #these are retweets
##  [1]  8 13 15 24 28 29 30 31 34 36 39 40 41 43 44 47 48 49 50 51 52 53 54
## [24] 56 57 58 59 60 61 62 63 64 65 66 68 69 70 71 72 75 88 89 90 92 96 98

library(stringr)

#https://sites.google.com/site/miningtwitter/questions/user-tweets/who-retweet
# create list to store user names
who_retweet = as.list(1:length(rt)); who_post = as.list(1:length(rt))

for (i in 1:length(rt)) {
  twit = tweets[[rt[i]]] # get tweet with retweet entity
   poster = str_extract_all(twit$getText(),"(RT|via)((?:\\b\\W*@\\w+)+)") # get retweet source 
   poster = gsub(":", "", unlist(poster)) #remove ':'
   who_post[[i]] = gsub("(RT @|via @)", "", poster, ignore.case=TRUE)   # name of retweeted user
   who_retweet[[i]] = rep(twit$getScreenName(), length(poster))  # name of retweeting user 
}

who_post = unlist(who_post); who_retweet = unlist(who_retweet) #unlist

Options for the setting up the graph, styling is hidden.

library(igraph)

# two column matrix of edges
retweeter_poster = cbind(who_retweet, who_post)

# generate graph
rt_graph = graph.edgelist(retweeter_poster)

# get vertex names
ver_labs = get.vertex.attribute(rt_graph, "name", index=V(rt_graph))

# choose some layout
glay = layout.fruchterman.reingold(rt_graph)

Example 3. Facebook

You can also scrape posts from public figure profiles on Facebook using the Rfacebook library. You cannot scrape personal pages.

The set up is pretty similar to Twitter but there are fewer limitations on the data you get access to. Over the summer I was able to pull down at least 10,000 posts per profile, but that may not be true any more.

You can get either short term or long term tokens

To get a short term token

  1. Go to https://developers.facebook.com/tools/explorer/

  2. Click on “Get Access Token”

  3. Copy and paste as token

token1 <- token1  #I hid mine from you

To get a long term token

  1. Go to https://developers.facebook.com/tools/explorer/

  2. Click Register on the top right

  3. Accept their conditions but skip the app set up

  4. Copy the App ID and App Secret

  5. Concatenate them with a “|” in between as your token

appid <- appid  #I hid mine from you
appsecret <- appsecret #I hid mine from you
token2 <- paste0(appid, "|", appsecret)

The API limits us to a set amount of data. For very popular figures like Obama with many comments this means very few posts; while setting up the tutorial I was only able to get 37 posts from Obama. .

For less popular figures we can get at more posts. I was able to get several hundred posts for a cleric.

Note: if your person’s profile name is not in Latin characters it will not copy well, and will include a number at the end, like this link.

https://www.facebook.com/%D8%A7%D9%84%D8%B4%D9%8A%D8%AE-%D9%85%D8%AD%D9%85%D8%AF-%D8%AE%D9%8A%D8%B1-%D8%A7%D9%84%D8%B7%D8%B1%D8%B4%D8%A7%D9%86-199724386738517/`

Just use the number at the end “199724386738517” as your page name.

Load the library, save a link, and pull a fixed number of posts.

library(Rfacebook)

profiles <- list("199724386738517")

page <- getPage(profiles[[1]], token2, n=50)
## 50 posts

Here are all the different kinds of information we can get from this data

str(page, width=90, strict.width="cut")
## 'data.frame':    50 obs. of  10 variables:
##  $ from_id       : chr  "199724386738517" "199724386738517" "199724386738517" "19972438"..
##  $ from_name     : chr  "الشيخ محمد خير الطرشان" "الشيخ محمد خير الطرشان" "الشيخ محمد خ"..
##  $ message       : chr  "ا<U+FEF9>قبال على الله \n\nقيل لمعروف الكرخي: كيف اصطلحت مع رب"..
##  $ created_time  : chr  "2016-03-31T15:34:05+0000" "2016-03-18T08:06:21+0000" "2016-03-"..
##  $ type          : chr  "photo" "photo" "photo" "photo" ...
##  $ link          : chr  "https://www.facebook.com/199724386738517/photos/a.202410079803"..
##  $ id            : chr  "199724386738517_1192213477489598" "199724386738517_11772698356"..
##  $ likes_count   : num  104 160 223 285 197 186 138 334 263 208 ...
##  $ comments_count: num  7 11 34 17 17 6 8 39 16 14 ...
##  $ shares_count  : num  6 8 17 37 4 7 2 6 17 10 ...

Lets get the count of the most common words. Use str_extract_all to extract text by pattern. For English, use "[A-Za-z]+", other langauges are more complicated but copy it from below.

#here's a good tutorial
#https://stat545-ubc.github.io/block027_regular-expressions.html
library(stringr)  #str_extract_all

words <- str_extract_all(page$message, pattern="[\u0600-\u06ff]+|[\u0750-\u077f]+|[\ufb50-\ufbc1]+|[\ufbd3-\ufd3f]+|[\ufd50-\ufd8f]+|[\ufd92-\ufdc7]+|[\ufe70-\ufefc]+|[\uFDF0-\uFDFD]+")
#I got the arabic unicode pattern from here, it mostly works but could be a little better
#http://stackoverflow.com/questions/11323596/regular-expression-for-arabic-language

repair_encoding(words[[1]][1:30], from="UTF-8")
## Warning in stringi::stri_conv(x, from = from): the Unicode codepoint
## \U0000fef9 cannot be converted to destination encoding
##  [1] "ا"        "\032"     "قبال"     "على"      "الله"     "قيل"     
##  [7] "لمعروف"   "الكرخي"   "كيف"      "اصطلحت"   "مع"       "ربك؟"    
## [13] "فقال"     "بقبولي"   "موعظة"    "ابن"      "السماك"   "رحمه"    
## [19] "الله"     "قيل"      "له"       "وكيف؟"    "قال"      "كنت"     
## [25] "ماراً"    "بالكوفة،" "فدخلت"    "مسجداً"   "أبتغي"    "صلاة"

Now turn that list of words into a table with counts and a word cloud.

library(plyr)  #arrange function
library(dplyr)  #top_n
library(wordcloud)

word_counts <- unlist(words) %>% table %>% data.frame  #unlist the words and make a freq table
names(word_counts) <- c("word", "count")  #add column names
word_counts %>% arrange(desc(count)) %>% top_n(5) %>% type_convert() #order the table
##   word count
## 1 الله    92
## 2   في    63
## 3   من    48
## 4 عليه    31
## 5    ،    28
## 6  قال    28

Now lets make an extraneous word cloud, its actually kind of pretty.

## Selecting by count

Thanks for listening!


Any questions?

Additional slides

xpathApply()

The xpathApply() functions in the XML library are a little more complicated to use than the rvest functions (unnecessarily so) but they do deal with encoding better (avoiding repair_encoding() or type_convert()).

xpathApply(), which takes an parsed html (done by htmlTreeParse()) and a set of criteria for which nodes you want. It can return a couple of different properties of that chunk, including the text and links. To get the text use the xmlValue property, to get the links use the xmlGetAttr, 'href' property. I’ll give an example below and it will come up a lot in the tutorial.

Notice that xpathApply is an apply function so it is vectorized and it returns lists. Quick reminder for dealing with list elements; [i] returns a list, [[i]] returns a vector, which is usually what we want. ## A generic version of the function looks like this xpathApply(document, path, properties)

  • document is just a html document that has been parsed by htmlTreeParse()

  • path is the node that we want in the document. Just type two forward-slashes and then the letters from the beginning of the section tag, i.e. "//p". Add further critieria using brackets and an ampersand, i.e. "//p[@class='this one']". Finally, your criteria can include references to preceeding sections, such as paragraphs that follow immediately after a heading; seperate sections with a forward-slash, i.e. "//h1/p".

  • properties specify what output you want. xmlValue gives you the text of the section,xmlGetAttr, "href" gives you the links.

Example of HTML and xpathApply()

First, parse the html code above with htmlTreeParse

document <- htmlTreeParse(link, useInternal=TRUE)  #useInternal makes it a little faster

Then, we can get the text from the heading using the following code which specifies the heading (“//h1”) and that we want the text (xmlValue)

xpathApply(document, "//h1", xmlValue)  #getting the value of the heading section
## [1] "This is a heading"

To get the text from the second paragraph we could either use its class.

xpathApply(document, "//p[@class='thisOne']", xmlValue)  #getting the value of the paragraph section we want
## [1] "But I only want this paragraph"

Or we could specify its position as the second paragraph following a heading. Obviously this is a worse strategy.

xpathApply(document, "//h1/p/p", xmlValue)  #getting the value of the paragraph section we want
## [1] "But I only want this paragraph"

Sometimes it can get slightly more complicated but those are really all the stepping stones you need.