Using the NYT API

The New York Times made an API (Application program interface) that gives you access to their articles without having to use a text scraper. Presumably they made it after people kept scraping their website and giving them lots of traffic.

I put together some code that looks for the word “mcmaster” from Januay 1st, 2017 until February 27th, 2017 and returns all of the headlines and dates of those articles. You can add an additional loop on top of this that takes a list of terms and iterates through them.

The complication is that the NYT API only returns 10 hits each time, so you have to cycle through them 10 at a time. There’s some limit on how many times you can run it a day but I don’t really understand how it works and it doesn’t seem to count consistently.

Setting it up

First, you have to go to their website to get a personally identifying key so that they can limit your usage of the function per day. Use this link: https://developer.nytimes.com/signup

Searching

This pages gives more details about the search function options: http://developer.nytimes.com/article_search_v2.json#/Documentation/GET/articlesearch.json

#https://cran.r-project.org/web/packages/rtimes/vignettes/rtimes_vignette.html
#http://developer.nytimes.com/article_search_v2.json#/Documentation/GET/articlesearch.json
library(rtimes)
library(beepr)

key <- key

#need to run it once to get the total number of hits
res <- as_search(q="mcmaster", 
                 begin_date = "20170101", 
                 end_date = '20170227', 
                 fl = c("headline","pub_date"),
                 sort = "newest",
                 page = 0, key=key)

headline <-res$data$headline.main
date <- res$data$pub_date

tot <- floor(res$meta$hits/10)  #gets the total number of iterations

bottom <- 1  #makes it easier to start again if there's an error

for(i in bottom:tot){
    if(i == 1) cat(paste0("Starting ",i," of ",tot, " additional iterations \n"))
    bottom <- i  #move up the loop
    Sys.sleep(.75)  #need to wait a little bit, .75 is as little as I can get it with the NYT complaining
    cat(paste0(i, "..."))
    res <- as_search(q="mcmaster", 
                     begin_date = "20170101", 
                     end_date = '20170227', 
                     fl = c("headline","pub_date"),
                     page = i, key=key)
    headline.tmp <-res$data$headline.main
    date.tmp <- res$data$pub_date
    
    headline <- c(headline, headline.tmp)  #bind them together
    date <- c(date, date.tmp)  #bind them together
    
}
## Starting 1 of 2 additional iterations 
## 1...2...
cat("\n Finished!")
## 
##  Finished!
beep()

Here’s what the output looks like

dat <- cbind(headline, date)
head(dat)
##      headline                                                                     
## [1,] "First Big Test for Mattis: Pitch Plans to Fight ISIS and Not Alienate Trump"
## [2,] "Will Trump Take ‘Brutally Forthright’ Advice From McMaster?"                
## [3,] "Calling Secretary Tillerson"                                                
## [4,] "The Islamophobic Huckster in the White House"                               
## [5,] "H.R. McMaster Breaks With Administration on Views of Islam"                 
## [6,] "Friday Mailbag: Faulty Headlines, Insensitive Descriptions"                 
##      date                      
## [1,] "2017-02-26T20:01:27+0000"
## [2,] "2017-02-25T20:31:00+0000"
## [3,] "2017-02-25T03:27:56+0000"
## [4,] "2017-02-25T01:13:50+0000"
## [5,] "2017-02-25T01:12:50+0000"
## [6,] "2017-02-24T10:00:32+0000"

And if I wanted to just see headlines that included the word “breaks”, as in disagreements over policy, I would just throw those kinds of words into a regular expression

rows <- grepl("breaks", dat[,"headline"], ignore.case = TRUE)
dat[rows,]
##      headline                                                    
## [1,] "H.R. McMaster Breaks With Administration on Views of Islam"
## [2,] "Joe Scarborough Breaks Down Trump for Stephen Colbert"     
##      date                      
## [1,] "2017-02-25T01:12:50+0000"
## [2,] "2017-02-22T10:02:40+0000"

Checking usage

You can use a curl request to get information about your daily limit from the NYT

#on my mac this tells me how many requests I have left today, but doesn't work for a pc
check <- paste0("curl --head https://api.nytimes.com/svc/books/v3/lists/overview.json?api-key=", key,' 2>/dev/null | grep -i "X-RateLimit"')
system(check)
X-RateLimit-Limit-day: 1000
X-RateLimit-Limit-second: 5
X-RateLimit-Remaining-day: 996
X-RateLimit-Remaining-second: 4