Scraping forum pages

One application of web scraping is to write an algorithm that loops through a website and extracts particular pieces of information. For example, Rich Nielsen studies the ideology of Islamic clerics and has compiled a dataset of their writing on various topics posted to internet forums.

Islamic Clerics

Suppose we have a list of a few hundred of these clerics and their webpages on a particular forum and want to download their biographical information and their formal legal opinions (fatwas) on religious matters, which are all contained in links off of their main page. Each cleric only has one CV but they also have between ten and a hundred fatwas, far too many web pages to extract by hand.

Here’s an example of one of the webpages we want to scrape: http://islamtoday.net/istesharat/schcv-3240.htm.

Here’s the translated page

We’re interested in two things

  1. Their CV, listed on this first page as part of a table.

  2. Their fatwas, listed on subsequent pages by following the
    “عرض الإستشارات”/“View Advice” link.

We’re going to have to dig a few levels deep into the website to get to the fatwas. An alternative approach would be to download everything on this website using wget, as I describe in the Introduction to Web Scraping tutorial, but it would not perform this task well because the fatwas are actually organized in a different directory of the website and would not be associated with their authors. These forumns can also be very large. Here are the steps through the webpage.

Inspecting web pages

Before we do anything, we want to learn how this webpage is put together.

  • In Chrome, right click on the CV area and hit ‘Inspect’.

  • Move around on the side bar until you find the piece that highlights the CV but nothing more. Notice that this is a table and is given the id ‘table1’.

  • Right click on the link to the fatwas and hit ‘Inspect’. Notice that this part is given the id ‘ctl00_ContentPlaceHolder1_…’.

  • Also notice that the link to the next page is not in a subdirectory of this page, so wget won’t work well.

We could also look through the html by right clicking anywhere on the page and selecting “View Page Source”.

Download the CV

First we want to download the CV table on the initial page. We could save all of the text on this first page but we actually just want the CV table and not any of the other words on the page. Also, it would be great if this CV was formatted as a table that we could work with. The html_table() function is perfect for this job.

Load in your link(s). In the real world you would be loading in a list of links and then iterating this process over all of them.

link.list <- c("http://islamtoday.net/istesharat/schcv-3240.htm")

#for(i in 1:length(link.list)){
i = 1; link <- link.list[i]  #store the link for this particular iteration

Use the read_html() function to parse the links, then select the right table node with html_nodes(), finally convert this table into a dataframe with html_table().

library(rvest)  #for html parsing

doc.html <- read_html(link)  #there's no point to specifying the encoding
html_nodes(doc.html,"table") #check the names of all the tables on this page
## {xml_nodeset (2)}
## [1] <table class="table borderless" cellspacing="0" cellpadding="0">\n<t ...
## [2] <table id="table1" dir="rtl" style="BORDER-TOP-WIDTH: 0px; BORDER-LE ...
table <- html_table(html_nodes(doc.html,"table")[[2]])
head(table)
##                       X1                                           X2
## 1 اللقب العلمي والوظيفة: مدرس متفرغ بأكاديمية المقطم للعلوم الحديثة .
## 2          جامعة التخرج:     جامعة الأقصى غزة وجامعية عين شمس القاهرة
## 3           كلية التخرج:             كلية التربية قسم العلوم الفلسفية
## 4          التخصص العام:                                    صحة نفسية
## 5          لتخصص الدقيق:                                 غير العاديين
## 6          مكان الميلاد:                                 الإسكندرية .

For a PC you might need to correct the encoding

library(readr)  #for type_convert
head(type_convert(table))

But the website isn’t quite set up as a normal table (there are a few long merged rows) so cut and paste those rows.

#library(knitr)  #kable function

#the data in row 9 column 1 should actually be row 8 column 2
table[8,2] <- table[9,1]  #switch
table <- table[-9,]  #drop 

#kable(table)

Finally, save the table as a text file in the working directory, in the real world we’d set up a folder structure to organize these files. Notice that I’m using the fileEncoding="UTF-8" option and saving as a text file.

#save as txt because we want to be able to analyze it easily later
write.table(table, "table.txt", sep=",", quote=TRUE, fileEncoding="UTF-8")

Downloading Fatwas

Next, we’re going to have to dig a few levels into the website to get the fatwas that this particular cleric has written. These fatwas are advice that the cleric has given on a religious matter to someone who has written in. In the example that I’m going to show, a young man is asking whether he can communicate with a foreign Muslim girl he met over the internet and the cleric is saying that he should not continue the communication. Once we figure out the steps for any given page it will work for any other cleric and any other ruling as well.

These are the steps we’re going to follow again

The First Page

First we need to programmatically find the link to the second page on the first page. This is the page source for the first page, showing the id for the link from the first page to the second page.

    <tr>
        <td class="active">
         
         <h5 class="nopadding">المشاركات</h5>
         </td>
    </tr>
    <tr>
        <td class="FatwaLeftSideCell" style="text-align:right">
        <a id="ctl00_ContentPlaceHolder1_SchList1_Repeater1_ctl00_HyperLink1" href="schques-70-3240-1.htm">عرض الإستشارات</a>
        </td>
    </tr>
    <tr>
        <td colspan="2" class="active">
        <h5 class="nopadding"> السيرة الذاتية</h5>
         </td>
    </tr>

Use the html_nodes function to get the link to the second page using the id that we just identified

follow0 <- html_nodes(doc.html,css="#ctl00_ContentPlaceHolder1_SchList1_Repeater1_ctl00_HyperLink1")

Get links with the xml_attr function

follow <- xml_attr(follow0, "href")
follow
## [1] "schques-70-3240-1.htm"

We could also get the text or other properties if we wanted.

html_text(follow0); rm(follow0)
## [1] "عرض الإستشارات"

For a PC, use the repair_encoding function from rvest to display correctly.

library(rvest)
repair_encoding(html_text(follow0), from="utf-8"); rm(follow0)

But the link it returns is relative to the base page, and we need to create an absolute link. Replace the end of the base link with our new link.

follow1 <- gsub("([^/]+$)", follow, link)  #replace after the last slash with follow link

Here’s some more code to show how this works step by step

link #remember our original link
## [1] "http://islamtoday.net/istesharat/schcv-3240.htm"
gsub("([^/]+$)", "REPLACE", link)  #replace everything after the last / with "REPLACE"
## [1] "http://islamtoday.net/istesharat/REPLACE"
gsub("([^/]+$)", follow, link) #now instead of "REPLACE" use the relative link 
## [1] "http://islamtoday.net/istesharat/schques-70-3240-1.htm"

The Second Page

Now follow this new link and get the html of the second page. This brings us to a page with links to a number of fatwas, now get a list of these links and iterate through them.

Parse the html for this second page, and then get a list of all of the links on this second page that lead to the third page.

#level 2

doc.html <- read_html(follow1)
follow2 <- html_nodes(doc.html, ".QuesListTitleTxt")
follow2 <- xml_attr(follow2, "href")
follow2
## [1] "/istesharat/quesshow-70-174239.htm"
## [2] "/istesharat/quesshow-70-174199.htm"
## [3] "/istesharat/quesshow-70-174113.htm"
## [4] "/istesharat/quesshow-70-150445.htm"

Now change these relative paths to absolute paths. Notice that this time we’re getting a list of links instead of a single link so I’m using lapply() to run the gsub() function on all of the links. I could also have converted the follow2 object into a vector and then used the normal apply() function but that step would have been unnecessary.

follow2 <- lapply(follow2, FUN= function(x) gsub("(/[^/]+/[^/]+$)", x, link)) #regex 2nd to last forward slash
unlist(follow2)
## [1] "http://islamtoday.net/istesharat/quesshow-70-174239.htm"
## [2] "http://islamtoday.net/istesharat/quesshow-70-174199.htm"
## [3] "http://islamtoday.net/istesharat/quesshow-70-174113.htm"
## [4] "http://islamtoday.net/istesharat/quesshow-70-150445.htm"

The Third Page

Now, follow the links to the third page with the actual fatwas. The table with the fatwas is called ‘table table-condensed’ but we want only the cleric’s words and not the people asking for advice. So only pull down the row that has his answer (the second table row with the class ‘article’).

Parse the third page and get the second row of the table with the ‘article’ class.

#for(i in 1:length(follow2)){
i <- 1; follow3 <- unlist(follow2[i])  #get a specific link
doc.html <- read_html(follow3)  #follow the link

html_nodes(doc.html, ".article")[[2]]  %>% #get the second row
        html_text() -> response #get the text and save it

print(substr(response,1,500))  #show the text
## [1] "\r\n                        \r\n                            الحمد لله والصلاة والسلام على رسول الله، وبعد:\r\nأخي الكريم أنت تعرف أن حديثك مع هذه الفتاه يعتبر حراما، وسوف أوضح لك لماذا يعتبر هذا الحديث حراما: إن الحديث بينكما صورة من صور الخلوة بين الرجل والمرأة الأجنبية، وهناك التحذير الشديد \"ما خلا رجل بامرأة أجنبية إلا وكان الشيطان ثالثهما\".\r\n- هل هذه التي تحادثها هل هي أختك أو زوجتك أو إحدى المحرمات عليك، أو هي فتاة أجنبية تعرفت عليها من خلال الشبكة العنكبوتية، بلا شك أنها أجنبية عنك يبدأ الحديث ف"

Finally, write this text to a text file in our working directory. In the real world we’d give it a more identifiable name like the cleric and the topic. The useBytes=TRUE option is necessary for the encoding to work right.

writeLines(response, paste0("response", i, ".txt"), useBytes=TRUE)