Public Web APIs
Learning Goals
- Understand the difference between acquiring data through web scraping vs. a web API
- Set up an API key for a public API
- Develop comfort in using a wrapper package or url-method of calling a web API
- Recognize the structure in a url for a web API and adjust for your purposes
You can download a template .Rmd of this activity here.
APIs
In this lesson you’ll learn how to collect data from websites such as The New York Times, Zillow, and Google. While these sites are primarily known for the information they provide to humans browsing the web, they (along with most large websites) also provide information to computer programs.
Humans use browsers such as Firefox or Chrome to navigate the web. Behind the scenes, our browsers communicate with web servers using a technology called HTTP or Hypertext Transfer Protocol.
Programming languages such as R can also use HTTP to communicate with web servers. We have seen how it is possible for R to “scrape” data from almost any static web page. However, it’s easiest to interact with websites that are specifically designed to communicate with programs. These Web APIs, or Web Application Programming Interfaces, focus on transmitting data, rather than images, colors, or other appearance-related information.
An large variety of web APIs provide data accessible to programs written in R (and almost any other programming language!). Almost all reasonably large commercial websites offer APIs. Todd Motto has compiled an excellent list of Public Web APIs on GitHub. Browse the list to see what kind of information is available.
Wrapper Packages
Possible readings:
1. NY Times API
2. NY Times Blog post announcing the API
3. Working with the NY Times API in R
4. nytimes pacakge for accessing the NY Times’ APIs from R
5. Video showing how to use the NY Times API
6. rOpenSci has a good collection of wrapper packages
In R, it is easiest to use Web APIs through a wrapper package, an R package written specifically for a particular Web API. The R development community has already contributed wrapper packages for most large Web APIs. To find a wrapper package, search the web for “R Package” and the name of the website. For example, a search for “R Reddit Package” returns RedditExtractor and a search for “R Weather.com Package” surfaces weatherData.
This activity will build on the New York Times Web API, which provides access to news articles, movie reviews, book reviews, and many other data. Our activity will specifically focus on the Article Search API, which finds information about news articles that contain a particular word or phrase.
We will use the nytimes package that provides functions for some (but not all) of the NYTimes APIs. First, install the package by copying the following two lines into your console (you just need to run these once):
install.packages("devtools")
devtools::install_github("mkearney/nytimes")
Next, take a look at the Article Search API example on the package website to get a sense of the syntax.
Exercise 14.13 What do you think the nytimes function below does? How does it communicate with the NY Times? Where is the data about articles stored?
<- nyt_search(q = "gamergate", n = 20, end_date = "20150101") res
To get started with the NY Times API, you must register and get an authentication key. Signup only takes a few seconds, and it lets the New York Times make sure nobody abuses their API for commercial purposes. It also rate limits their API and ensures programs don’t make too many requests per day. For the NY Times API, this limit is 1000 calls per day. Be aware that most APIs do have rate limits — especially for their free tiers.
Once you have signed up, verified your email, log back in to https://developer.nytimes.com. Under your email address, click on Apps and Create a new App (call it First API) and enable Article Search API, then press Save. This creates an authentication key, which is a 32 digit string with numbers and the letters a-e.
Store this in a variable as follows (this is just an example ID, not an actual one):
<- "c935b213b2dc1218050eec976283dbbd"
times_key
# Tell nytimes what our API key is
Sys.setenv(NYTIMES_KEY = times_key)
Now, let’s use the key to issue our first API call. We’ll adapt the code we see in the vignette to do what we need.
library(nytimes)
# Issue our first API call
<- nyt_search(q = "gamergate", n = 20, end_date = "20150101") res
## Error: couldn't find NYTIMES_KEY environment variable
# Convert response object to data frame
<- as.data.frame(res) res
## Error in as.data.frame(res): object 'res' not found
Something magical just happened. Your computer sent a message to the New York Times and asked for information about 20 articles about Gamergate starting at January 1, 2015 and going backwards in time. Thousands of public Web APIs allow your computer to tap into almost any piece of public digital information on the web.
Let’s take a peek at the structure of the results. You can also look at the data in the “Environment” tab in one of the windows of RStudio:
colnames(res)
## Error in is.data.frame(x): object 'res' not found
head(res$web_url)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'head': object 'res' not found
head(res$headline)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'head': object 'res' not found
head(res$pub_date)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'head': object 'res' not found
Accessing Web APIs
Wrapper packages such as nytimes
provide a convenient way to interact with Web APIs. However, many Web APIs have incomplete wrapper packages, or no wrapper package at all. Fortunately, most Web APIs share a common structure that R
can access relatively easily. There are two parts to each Web API: the request, which corresponds to a function call, and the response, which corresponds to the function’s return value.22
As mentioned earlier, a Web API call differs from a regular function call in that the request is sent over the Internet to a webserver, which performs the computation and calculates the return result, which is sent back over the Internet to the original computer.
Web API Requests
Possible readings:
1. Understanding URLs
2. urltools Vignette
The request for a Web API call is usually encoded through the URL, the web address associated with the API’s webserver. Let’s look at the URL associated with the first nytimes
nyt_search
example we did. Open the following URL in your browser (you should replace MY_KEY with the api key you were given earlier).
http://api.nytimes.com/svc/search/v2/articlesearch.json?q=gamergate&api-key=MY_KEY
The text you see in the browser is the response data. We’ll talk more about that in a bit. Right now, let’s focus on the structure of the URL. You can see that it has a few parts:
http://
— The scheme, which tells your browser or program how to communicate with the webserver. This will typically be eitherhttp:
orhttps:
.api.nytimes.com
— The hostname, which is a name that identifies the webserver that will process the request./svc/search/v2/articlesearch.json
— The path, which tells the webserver what function you would like to call.?q=gamergate&api-key=MY_KEY
— The query parameters, which provide the parameters for the function you would like to call. Note that the query can be thought of as a table, where each row has a key and a value (known as a key-value pair). In this case, the first row has keyq
and valuegamergate
and the second row has valueMY_KEY
. The query parameters are preceded by a?
. Rows in the key-value table are separated by ‘&’, and individual key-value pairs are separated by an=
.
Typically, each of these URL components will be specified in the API documentation. Sometimes, the scheme, hostname, and path (http://api.nytimes.com/svc/search/v2/articlesearch.json
) will be referred to as the endpoint for the API call.
We will use the urltools
module to build up a full URL from its parts. We start by creating a string with the endpoint and then add the parameters one by one using param_set
and url_encode
:
library(urltools)
<- "http://api.nytimes.com/svc/search/v2/articlesearch.json"
url <- param_set(url, "q", url_encode("marlon james"))
url <- param_set(url, "api-key", url_encode(times_key))
url url
Copy and paste the resulting URL into your browser to see what the NY Times response looks like!
Exercise 14.14 You may be wondering why we need to use param_set
and url_encode
instead of writing the full url by hand. This exercise will illustrate why we need to be careful.
- Repeat the above steps, but create a URL that finds articles related to
Ferris Bueller's Day Off
(note the apostrophe). What is interesting about how the title appears in the URL? - Repeat the steps above for the phrase
Nico & Vinz
(make sure you use the punctuation mark&
). What do you notice? - Take a look at the Wikipedia page describing percent encoding. Explain how the process works in your own words.
Web API Responses
Possible readings:
1. A Non-Programmer’s Introduction to JSON
2. Getting Started With JSON and jsonlite
3. Fetching JSON data from REST APIs
We now discuss the structure of the web response, the return value of the Web API function. Web APIs generate string responses. If you visited the earlier New York Times API link in your browser, you would be shown the string response from the New York Times webserver:
{"status":"OK","copyright":"Copyright (c) 2021 The New York Times Company. All Rights Reserved.","response":{"docs":[{"abstract":"Here’s what you need to know.","web_url":"https://www.nytimes.com/2019/08/16/briefing/rashida-tlaib-gamergate-greenland.html","snippet":"Here’s what you need to know.","lead_paragraph":"(Want to get this briefing by email? Here’s the sign-up.)","source":"The New York Times","multimedia":[{"rank":0,"subtype":"xlarge","caption":null,"credit":null,"type":"image","url":"images/2019/08/16/world/16US-AMBRIEFING-TLAIB-amcore/merlin_158003643_c67928bc-e547-4a2e-9344-5f0209ca024d-articleLarge.jpg","height":400,"width":600,"legacy":{"xlarge":"images/2019/08/16/world/16US-AMBRIEFING-TLAIB-amcore/merlin_158003643_c67928bc-e547-4a2e-9344-5f0209ca024d-articleLarge.jpg","xlargewidth":600,"xlargeheight":400},"subType":"xlarge","crop_name":"articleLarge"},...
If you stared very hard at the above response, you may be able to interpret it. However, it would be much easier to interact with the response in some more structured, programmatic way. The vast majority of Web APIs, including the New York Times, use a standard called JSON (Javascript Object Notation) to take data and encode it as a string. To understand the structure of JSON, take the NY Times web response in your browser, and copy and paste it into an online JSON formatter. The formatter will add newlines and tabs to make the data more human interpretable. You’ll see the following:
{
"status":"OK",
"copyright":"Copyright (c) 2021 The New York Times Company. All Rights Reserved.",
"response":{
"docs":[
# A HUGE piece of data, with one object for each of the result articles
],
"meta":{
"hits":128,
"offset":0,
"time":93
}
}
}
You’ll notice a few things in the JSON above:
- Strings are enclosed in double quotes, for example
"status"
and"OK"
. - Numbers are written plainly, like
2350
or72
. - Some data is enclosed in square brackets
[
and]
. These data containers can be thought of as R lists. - Some data is enclosed in curly braces
{
and}
. These data containers are called Objects. An object can be thought of as a single observation in a table. The columns or variables for the observation appear as keys on the left (hits
,offset
, etc.). The values appear after the specific key separated by a colon (2350
, and0
, respectively). Thus, we can think of themeta
object above as:
hits | offset | time |
---|---|---|
128 | 0 | 93 |
Let’s repeat the NY Times search for gamergate, but this time we will peform the Web API call by hand instead of using the nytimes
wrapper package. We will use the jsonlite
package to retrieve the response from the webserver and turn the string response into an R
object. The fromJson
function sends our request out over and across the web to the NY Times webserver, retrieves it, and turns it from a JSON-formatted string into R data.
library(jsonlite)
# Rebuild the URL
<- "http://api.nytimes.com/svc/search/v2/articlesearch.json"
url <- param_set(url, "q", url_encode("gamergate"))
url <- param_set(url, "api-key", url_encode(times_key)) url
## Error in url_encode(times_key): object 'times_key' not found
# Send the request to the webserver over the Internet and
# retrieve the JSON response. Turn the JSON response into an
# R Object.
<- fromJSON(url) response_js
## Error in open.connection(con, "rb"): cannot open the connection to 'http://api.nytimes.com/svc/search/v2/articlesearch.json?q=gamergate'
The jsonlite
makes the keys and values of an object available as attributes. For example, we can fetch the status:
$status response_js
## Error in eval(expr, envir, enclos): object 'response_js' not found
While some keys in the object are associated with simple values, such as "status"
, others are associated with more complex data. For example, the key "response"
is associated with an object that has two keys: "docs"
, and "meta"
. "meta"
is another object: { "hits":128, "offset":0, "time":19 }
. We can retrieve these nested attributes by sequentially accessing the object keys from the outside in. For example, the inner "hits"
attribute would be accessed as follows:
$response$meta$hits response_js
## Error in eval(expr, envir, enclos): object 'response_js' not found
Exercise 14.15 Retrieve the data associated with
- the
copyright
key of theresponse_js
object, and - the
time
attribute nested within themeta
object.
The majority of the data is stored under response
, in docs
. Notice that docs
is a list, where each element of the list is a JSON object that looks like the following:
{
"web_url":"https://www.nytimes.com/2017/06/27/arts/milkshake-duck-meme.html",
"snippet":"Oxford Dictionaries is keeping a close eye on a term that describes someone who rapidly gains and inevitably loses the internet’s intense love.",
"blog":{ },
"source":"The New York Times",
"multimedia":[
... A LIST OF OBJECTS ...
],
"headline":{
"main":"How a Joke Becomes a Meme: The Birth of ‘Milkshake Duck’",
"print_headline":"How a Joke Becomes a Meme: The Birth of ‘Milkshake Duck’"
},
"keywords":[
... A LIST OF OBJECTS ...
],
"pub_date":"2017-06-27T12:24:20+0000",
"document_type":"article",
"new_desk":"Culture",
"byline":{
"original":"By JONAH ENGEL BROMWICH"
},
"type_of_material":"News",
"_id":"59524e7f7c459f257c1ac39f",
"word_count":1033,
"score":0.35532707,
"uri":"nyt://article/a3e5bf4a-6216-5dba-9983-73bc45a98e69"
},
jsonlite
makes lists of objects available as a data frame, where the columns are the keys in the object (web_url
, snippet
, etc.)
<- response_js$response$docs docs_df
## Error in eval(expr, envir, enclos): object 'response_js' not found
class(docs_df)
## Error in eval(expr, envir, enclos): object 'docs_df' not found
colnames(docs_df)
## Error in is.data.frame(x): object 'docs_df' not found
dim(docs_df)
## Error in eval(expr, envir, enclos): object 'docs_df' not found
Exercise 14.16 (Your own article search) Consider the following:
- Select your own article search query (any topic of interest to you). You may want to play with NY Times online search or the API web search console to find a query that is interesting, but not overly popular. You can change any part of the query you would like. Your query should have at least 30 matches.
- Retrieve data for the first three pages of search results from the article search API, and create a data frame that joins together the
docs
data frames for the three pages of results. Hint: The example in the section below shows how to get different pages of results and use `bind_rows to combine them.
- Visualize the number of search results per day or month in your result set.
A Note on Nested Data Frames
Here is some code to generate queries on NY Times articles about the Red Sox. It fetches the first thirty entries in batches of 10.
<- "http://api.nytimes.com/svc/search/v2/articlesearch.json"
url <- param_set(url, "q", url_encode("Red Sox"))
url <- param_set(url, "api-key", url_encode(times_key)) url
## Error in url_encode(times_key): object 'times_key' not found
<- param_set(url, "page", 0)
url <- fromJSON(url) res1
## Error in open.connection(con, "rb"): cannot open the connection to 'http://api.nytimes.com/svc/search/v2/articlesearch.json?q=Red%20Sox&page=0'
# This pauses for 1 second.
# It is required when knitting to prevent R from issuing too many requests to
# The NY Times API at a time. If you don't have it you will get an error that
# says "Too Many Requests (429)"
Sys.sleep(1)
<- param_set(url, "page", 1)
url <- fromJSON(url) res2
## Error in open.connection(con, "rb"): cannot open the connection to 'http://api.nytimes.com/svc/search/v2/articlesearch.json?q=Red%20Sox&page=1'
Sys.sleep(1)
<- param_set(url, "page", 2)
url <- fromJSON(url) res3
## Error in open.connection(con, "rb"): cannot open the connection to 'http://api.nytimes.com/svc/search/v2/articlesearch.json?q=Red%20Sox&page=2'
<- res1$response$docs docs1
## Error in eval(expr, envir, enclos): object 'res1' not found
<- res2$response$docs docs2
## Error in eval(expr, envir, enclos): object 'res2' not found
<- res3$response$docs docs3
## Error in eval(expr, envir, enclos): object 'res3' not found
Each of these docs variables is a table with ten entries (articles) and the same 18 variables:
names(docs1)
## Error in eval(expr, envir, enclos): object 'docs1' not found
Now we want to stack the tables on top of each other to get a single table with 30 rows and 18 variables. If you try the following command:
bind_rows(docs1,docs2,docs3)
then you will get an error saying “Error in bind_rows_(x, .id) : Argument 4 can’t be a list containing data frames.”
What is happening???
Let’s check out the first column of the docs1
table:
$web_url docs1
## Error in eval(expr, envir, enclos): object 'docs1' not found
It lists the web addresses of the first ten sites returned in the search. It is a vector of ten character strings, which is just fine for one column of data in our table.
Now let’s check out the headline
variable:
$headline docs1
## Error in eval(expr, envir, enclos): object 'docs1' not found
The headline
variable is actually a data frame that contains three variables: main
, kicker
, and print_headline
. That is, we have nested data frames. This is a common problem when scraping data from JSON files, and it is why we are not able to directly bind the rows of our three tables on top of each other.
We can check out the type of variable in each column with the class
function:
sapply(docs1, class)
## Error in lapply(X = X, FUN = FUN, ...): object 'docs1' not found
We see that blog
, headline
, and byline
are the three problem columns that each contain their own data frames.
The solution is to flatten these variables, which generates a new column in the outer table for each of the columns in the inner tables.
<- jsonlite::flatten(docs1) docs1_flat
## Error in is.data.frame(x): object 'docs1' not found
names(docs1_flat)
## Error in eval(expr, envir, enclos): object 'docs1_flat' not found
sapply(docs1_flat, class)
## Error in lapply(X = X, FUN = FUN, ...): object 'docs1_flat' not found
The headline
variable is now replaced with seven separate columns for headline.main
, headline.kicker
, headline.content_kicker
, headline.print_headline
, headline.name
, headline.seo
, and headline.sub
. The byline
variable is replaced with three separae columns. The blog
variable contained an empty data frame, so it has been removed. The overall result is a new flat table with 25 columns, and no more nested data frames.
Once the data is flattened, we can bind rows:
<- bind_rows(jsonlite::flatten(docs1), jsonlite::flatten(docs2), jsonlite::flatten(docs3)) all_docs
## Error in is.data.frame(x): object 'docs1' not found
dim(all_docs)
## Error in eval(expr, envir, enclos): object 'all_docs' not found
Additional Practice
Exercise 14.17 (Choose-your-own public API visualization) Browse toddomotos’ list of Public APIS and abhishekbanthia’s list of Public APIs. Select one of the APIs from the list. Here are a few criteria you should consider:
- You must use the JSON approach we illustrated above; not all APIs support JSON.23
- Stay away from APIs that require OAuth for Authorization unless you are prepared for extra work before you get data! Most of the large social APIs (Facebook, LinkedIn, Twitter, etc.) require OAuth. toddomoto’s page lists this explicitly, but you’ll need to dig a bit if the API is only on abhishekbanthia’s list.
- You will probably need to explore several different APIs before you find one that works well for your interests and this assignment.
- Beware of the
rate limits
associated with the API you choose. These determine the maximimum number of API calls you can make per second, hour or day. Though these are not always officially published, you can find them by Google (for example)GitHub API rate limit
. If you need to slow your program down to meet the API insert calls toSys.sleep(1)
as is done in the example below. - Sketch out one interesting visualization that relies on the public API you selected earlier. Make sure the exact data you need is available. If it’s not, try a new visualization or API.
- If a wrapper package is available, you may use it, but you should also try to create the request URL and retrieve the JSON data using the techniques we showed earlier, without the wrapper package.
- Visualize the data you collected and describe the results.