14  APIs - Part 2

Settling In

Sit with your project group.

You can download a template Quarto file to start from here. Put this file in a folder called data_acquisition within a folder for this course.

Announcements

  • When Data Acquisition / Cleaning takes a lot of computational power:
    • Put eval=FALSE in the computationally heavy code chunks in a qmd file (for a skills challenge)
    • OR Put the code in a cleaning.R file (for the project)
    • After running the computationally heavy cleaning code, save the data locally with write_csv(object_name, file = 'cleandata.csv') and read it back in with read_csv('cleandata.csv') for your analysis.

Data Storytelling Moment

Go to https://pudding.cool/2019/03/hype/

  • What is the data story?
  • What is effective?
  • What could be improved?

Learning goals

After this lesson, you should be able to:

  • Explain what an API is
  • Set up an API key for a public API
  • Develop comfort in using a URL-method of calling a web API
  • Recognize the structure in a URL for a web API and adjust for your purposes
  • Explore and subset complex nested lists






APIs

An API stands for Application Programming Interface, and this term describes a general class of tool that allows computer software, rather than humans, to interact with an organization’s data.

  • Application refers to software.
  • Interface can be thought of as a contract of service between two applications.
    • This contract defines how the two communicate with each other using requests and responses.

Every API has documentation for how software developers should structure requests for data / information and in what format to expect responses.

Web APIs

Web APIs, or Web Application Programming Interfaces, which focus on transmitting requests and responses for raw data through a web browser.

  • Our browsers communicate with web servers using a technology called HTTP or Hypertext Transfer Protocol.
  • Programming languages such as R can also use HTTP to communicate with web servers.

URL

Every document on the Web has a unique address. This address is known as Uniform Resource Locator (URL).

Every URL has the same general structure. Let’s look at this example:

https://api.census.gov/data/2019/acs/acs1?get=NAME,B02015_009E,B02015_009M&for=state:*

  • https://api.census.gov: This is the base URL.
    • http://: The scheme, which tells your browser or program how to communicate with the web server. This will typically be either http: or https:.
    • api.census.gov: The hostname or host address, which is a name that identifies the web server that will process the request.
  • data/2019/acs/acs1: The file path, which tells the web server how to get to the desired resource.
  • ?get=NAME,B02015_009E,B02015_009M&for=state:*: The query string or query parameters, which provide the parameters for the function you would like to call.
    • This is a string of key-value pairs separated by &. That is, the general structure of this part is key1=value1&key2=value2.
key value
get NAME,B02015_009E,B02015_009M
for state:*

Example Web API’s

A large variety of web APIs provide data. Almost all reasonably large commercial websites offer Web APIs.

Todd Motto has compiled an expansive list of Public Web APIs on GitHub. Browse this list to see what data sources are available.

Wrapper packages

In R, it is easiest to access Web APIs through a wrapper package, an R package written specifically for a particular Web API.

  • The R development community has already contributed wrapper packages for most large Web APIs.
  • To find a wrapper package, search the web for “R package” and the name of the website. For example:
  • rOpenSci also has a good collection of wrapper packages.

In our work with maps, we’ve used the tidycensus package to obtain census data to display on maps. tidycensus is a wrapper package that makes it easy to obtain desired census information:

tidycensus::get_acs(
    year = 2020,
    state = "MN",
    geography = "tract",
    variables = c("B01003_001", "B19013_001"),
    output = "wide",
    geometry = TRUE
)

Extra resources:

What is going on behind the scenes with get_acs()? Let’s look at access Web API’s directly with URL’s.





Accessing web APIs directly

Getting a Census API key

Many APIs require users to obtain a key to use their services.

  • This lets organizations keep track of what data is being used.
  • It also rate limits their API and ensures programs don’t make too many requests per day/minute/hour. Be aware that most APIs do have rate limits — especially for their free tiers.

Navigate to https://api.census.gov/data/key_signup.html to obtain a Census API key:

  • Organization: Macalester College
  • Email: Your Mac email address

You will get the message:

Your request for a new API key has been successfully submitted. Please check your email. In a few minutes you should receive a message with instructions on how to activate your new key.

Check your email. Copy and paste your key into a new text file:

  • File > New File > Text File (towards the bottom of the menu)
  • Save as census_api_key.txt in the same folder as this .qmd.

Read in the key with the following code:

census_api_key <- readLines("census_api_key.txt")

Building a URL with httr2

We will use the httr2 package to build up a full URL from its parts because of URLs need to be percent encoded.

  • request() creates an API request object using the base URL
  • req_url_path_append() builds up the URL by adding path components separated by /
  • req_url_query() adds the ? separating the endpoint from the query and sets the key-value pairs in the query
    • The .multi argument controls how multiple values for a given key are combined.
    • The I() function around "state:*" inhibits parsing of special characters like : and *. (It’s known as the “as-is” function.)
    • The backticks around for are needed because for is a reserved word in R (for for-loops). You’ll need backticks whenever the key name has special characters (like spaces, dashes).
    • We can see from here that providing an API key is achieved with key=YOUR_API_KEY.
req <- request("https://api.census.gov") %>% 
    req_url_path_append("data") %>% 
    req_url_path_append("2019") %>% 
    req_url_path_append("acs") %>% 
    req_url_path_append("acs1") %>% 
    req_url_query(get = c("NAME", "B02015_009E", "B02015_009M"), `for` = I("state:*"), key = census_api_key, .multi = "comma")

Why would we ever use these steps instead of just using the full URL as a string?

  • To generalize this code with functions!
  • To handle special characters
    • e.g., query parameters might have spaces, which need to be represented in a particular way in a URL (URLs can’t contain spaces)

Sending a request with httr2

Once we’ve fully constructed our request, we can use req_perform() to send out the API request and get a response.

resp <- req_perform(req)

Getting a response with httr2

We see from Content-Type that the format of the response is something called JSON. We can navigate to the request URL to see the structure of this output.

  • We can use resp_body_json() in httr2 to parse the JSON into a nicer R format.
    • This function uses fromJSON() behind the scenes.
    • Without simplifyVector = TRUE, the JSON is read in as a list.
resp_json_df <- resp %>% resp_body_json(simplifyVector = TRUE)

# Data Cleaning
resp_json_df <- janitor::row_to_names(resp_json_df, 1) %>% # Move 1st row to Names
  as.data.frame() %>% # Convert Matrix to Data Frame
  mutate(across(starts_with('B'), as.numeric)) # Convert all variables that start with B to numeric

head(resp_json_df)
           NAME B02015_009E B02015_009M state
1   Mississippi          NA          NA    28
2      Missouri         953        1141    29
3       Montana          NA          NA    30
4      Nebraska         412         477    31
5        Nevada         863         745    32
6 New Hampshire          NA          NA    33

To learn more about JSON, consult the following readings:

  1. A Non-Programmer’s Introduction to JSON
  2. Getting Started With JSON and jsonlite
  3. Fetching JSON data from REST APIs

Exercises

Go to HW7_Part2.qmd. Work on those exercises to get practice building URL’s to work with API’s.

Don’t forget to come back to review this activity to see more API examples below.





More API Examples

Board Game Geek & XML Data

The Board Game Geek API is referenced in the Games & Comics section of toddmotto’s public API list.

Our goal is to use the search API at the bottom of the page.

Let’s start at the top of the API documentation page to see how to navigate this reference.

  • We can see from the XML references at the top that we will be expecting a new output format: XML stands for Extensible Markup Language
  • The “Root Path” section tells us the base URL for the Board Game Geeks API endpoints and related APIs: https://boardgamegeek.com/xmlapi2/
  • The “Search” section at the bottom of the page tells us:
    • the path for the search endpoint (/search)
    • what query parameters are possible
    • particular formatting instructions for query parameter values

The following request searches for board games, board game accessories, and board game expansions with the words “mystery” and “curse” in the title:

req <- request("https://boardgamegeek.com/xmlapi2") %>% 
    req_url_path_append("search") %>% 
    req_url_query(query = I("mystery+curse"), type = I("boardgame,boardgameaccessory,boardgameexpansion"))

When we use req_perform(), we see from Content-Type that the format of the response is something called XML. We can navigate to the request URL to see the structure of this output.

  • XML (Extensible Markup Language) is a tree structure of named nodes and attributes.
  • We can use resp_body_xml() to read in the XML as an R object.
resp <- req_perform(req)
resp
resp <- resp_body_xml(resp)
resp
{xml_document}
<items total="9" termsofuse="https://boardgamegeek.com/xmlapi/termsofuse">
[1] <item type="boardgame" id="63495">\n  <name type="primary" value="Murder  ...
[2] <item type="boardgame" id="40175">\n  <name type="primary" value="Murder  ...
[3] <item type="boardgame" id="256211">\n  <name type="primary" value="Robins ...
[4] <item type="boardgame" id="238391">\n  <name type="primary" value="Robins ...
[5] <item type="boardgameaccessory" id="289457">\n  <name type="primary" valu ...
[6] <item type="boardgameaccessory" id="291036">\n  <name type="primary" valu ...
[7] <item type="boardgameaccessory" id="309435">\n  <name type="alternate" va ...
[8] <item type="boardgameexpansion" id="256211">\n  <name type="primary" valu ...
[9] <item type="boardgameexpansion" id="238391">\n  <name type="primary" valu ...

The XML output is not packaged in a nice way. (We’d love to have a data frame.) We can use the xml2 package to explore and navigate the XML structure to extract the information we need.

Let’s first use the xml_structure() function to see how information is organized:

xml_structure(resp)
<items [total, termsofuse]>
  <item [type, id]>
    <name [type, value]>
    <yearpublished [value]>
  <item [type, id]>
    <name [type, value]>
    <yearpublished [value]>
  <item [type, id]>
    <name [type, value]>
    <yearpublished [value]>
  <item [type, id]>
    <name [type, value]>
    <yearpublished [value]>
  <item [type, id]>
    <name [type, value]>
    <yearpublished [value]>
  <item [type, id]>
    <name [type, value]>
  <item [type, id]>
    <name [type, value]>
    <yearpublished [value]>
  <item [type, id]>
    <name [type, value]>
    <yearpublished [value]>
  <item [type, id]>
    <name [type, value]>
    <yearpublished [value]>

The key navigation and extraction functions in xml2 are:

  • xml_children(): Get nodes that are nested inside
    • Like getting the first level bullet points inside a given bullet point
  • xml_find_all(): Finds nodes matching an XPath expression (XPath stands for XML path)
    • XPath expressions are like string regular expressions for XML trees
    • See here for a deeper dive into XPath
  • xml_attr(): Selects the value of an attribute (the information to the right of the = in quotes)
    • <node_name attribute_name1="attribute_value1" attribute_name2="attribute_value2">
# Get the item nodes in 2 different ways
resp %>% xml_find_all("item")
{xml_nodeset (9)}
[1] <item type="boardgame" id="63495">\n  <name type="primary" value="Murder  ...
[2] <item type="boardgame" id="40175">\n  <name type="primary" value="Murder  ...
[3] <item type="boardgame" id="256211">\n  <name type="primary" value="Robins ...
[4] <item type="boardgame" id="238391">\n  <name type="primary" value="Robins ...
[5] <item type="boardgameaccessory" id="289457">\n  <name type="primary" valu ...
[6] <item type="boardgameaccessory" id="291036">\n  <name type="primary" valu ...
[7] <item type="boardgameaccessory" id="309435">\n  <name type="alternate" va ...
[8] <item type="boardgameexpansion" id="256211">\n  <name type="primary" valu ...
[9] <item type="boardgameexpansion" id="238391">\n  <name type="primary" valu ...
resp %>% xml_children()
{xml_nodeset (9)}
[1] <item type="boardgame" id="63495">\n  <name type="primary" value="Murder  ...
[2] <item type="boardgame" id="40175">\n  <name type="primary" value="Murder  ...
[3] <item type="boardgame" id="256211">\n  <name type="primary" value="Robins ...
[4] <item type="boardgame" id="238391">\n  <name type="primary" value="Robins ...
[5] <item type="boardgameaccessory" id="289457">\n  <name type="primary" valu ...
[6] <item type="boardgameaccessory" id="291036">\n  <name type="primary" valu ...
[7] <item type="boardgameaccessory" id="309435">\n  <name type="alternate" va ...
[8] <item type="boardgameexpansion" id="256211">\n  <name type="primary" valu ...
[9] <item type="boardgameexpansion" id="238391">\n  <name type="primary" valu ...
# Get the item "type"
resp %>% xml_find_all("item") %>% xml_attr("type")
[1] "boardgame"          "boardgame"          "boardgame"         
[4] "boardgame"          "boardgameaccessory" "boardgameaccessory"
[7] "boardgameaccessory" "boardgameexpansion" "boardgameexpansion"
# The <name> and <yearpublished> nodes are nested within each <item>
resp %>% xml_find_all("item/name")
{xml_nodeset (9)}
[1] <name type="primary" value="Murder Mystery Dinner Party: The Curse of the ...
[2] <name type="primary" value="Murder Mystery Evening: The Curse of the Mumm ...
[3] <name type="primary" value="Robinson Crusoe: Adventures on the Cursed Isl ...
[4] <name type="primary" value="Robinson Crusoe: Adventures on the Cursed Isl ...
[5] <name type="primary" value="Robinson Crusoe: Adventures on the Cursed Isl ...
[6] <name type="primary" value="Robinson Crusoe: Adventures on the Cursed Isl ...
[7] <name type="alternate" value="Robinson Crusoe: Adventures on the Cursed I ...
[8] <name type="primary" value="Robinson Crusoe: Adventures on the Cursed Isl ...
[9] <name type="primary" value="Robinson Crusoe: Adventures on the Cursed Isl ...
resp %>% xml_find_all("item/yearpublished") # Notice that this is length 8 instead of 9!
{xml_nodeset (8)}
[1] <yearpublished value="2008"/>
[2] <yearpublished value="2007"/>
[3] <yearpublished value="2018"/>
[4] <yearpublished value="2019"/>
[5] <yearpublished value="2019"/>
[6] <yearpublished value="2019"/>
[7] <yearpublished value="2018"/>
[8] <yearpublished value="2019"/>
# Get the "primary" or "alternate" designation for each name
resp %>% xml_find_all("item/name") %>% xml_attr("type")
[1] "primary"   "primary"   "primary"   "primary"   "primary"   "primary"  
[7] "alternate" "primary"   "primary"  

Exercises: (not part of Homework)

  1. Get the board game name (e.g., “Murder Mystery…” and “Robinson Crusoe”)
  2. Get the board game ID number (e.g., 63495, 40175)





New York Times API

This example will build on the New York Times Web API, which provides access to news articles, movie reviews, book reviews, and many other data.

We will specifically focus on the Article Search API, which finds information about news articles that contain a particular word or phrase.

To get started with the NY Times API, you must register and get an authentication key. Signup only takes a few seconds, and it lets the New York Times make sure nobody abuses their API for commercial purposes. It also rate limits their API and ensures programs don’t make too many requests per day. For the NY Times API, this limit is 1000 calls per day.

Once you have signed up, verified your email, log back in to https://developer.nytimes.com. Under your email address, click on Apps and Create a new App (call it First API) and enable Article Search API, then press Save. This creates an authentication key, which is a 32 digit string with numbers and the letters a-e.

As with your census API key, save this key in a .txt file, and read it in and store this in a variable called times_key.

times_key <- read_lines("nyt_api_key.txt")

Open this URL in your browser (you should replace MY_KEY with the API key you were given).

http://api.nytimes.com/svc/search/v2/articlesearch.json?q=gamergate&api-key=MY_KEY

The text you see in the browser is the response data (in JSON format).

This URL has the same structure that we discussed above for the census API:

  • http:// — The scheme, which tells your browser or program how to communicate with the web server. This will typically be either http: or https:.
  • api.nytimes.com — The hostname, which is a name that identifies the web server that will process the request.
  • /svc/search/v2/articlesearch.json — The path, which tells the web server what function you would like to call (a function for searching articles).
  • ?q=gamergate&api-key=MY_KEY — The query parameters, which provide the parameters for the function you would like to call. The key value pairs are the following:
key value
q gamergate
api-key MY_KEY

The scheme, hostname, and path (http://api.nytimes.com/svc/search/v2/articlesearch.json) together form the endpoint for the API call.

We can use the httr2 package to build up a full URL from its parts:

req <- request("http://api.nytimes.com") %>% 
    req_url_path_append("svc") %>% 
    req_url_path_append("search") %>% 
    req_url_path_append("v2") %>% 
    req_url_path_append("articlesearch.json") %>% 
    req_url_query(q = "gamergate", `api-key` = times_key)

We can write a function to generate the URL for a user-specified query:

create_nyt_url <- function(query, key) {
    request("http://api.nytimes.com") %>% 
        req_url_path_append("svc") %>% 
        req_url_path_append("search") %>% 
        req_url_path_append("v2") %>% 
        req_url_path_append("articlesearch.json") %>% 
        req_url_query(q = query, `api-key` = key)
}

Let’s use this function to find articles related to:

  • Ferris Bueller's Day Off (note the spaces and the apostrophe)
  • Penn & Teller (note the spaces and the punctuation mark &)

Let’s see how these queries are translated into the URLs:

req_fb <- create_nyt_url(query = "Ferris Bueller's Day Off", key = times_key)
req_pt <- create_nyt_url(query = "Penn & Teller", key = times_key)

We can use req_perform() to send out the request and resp_body_json() to parse the resulting JSON:

resp_pt <- req_pt %>% req_perform() %>% resp_body_json(simplifyVector = TRUE)

Exploring complex lists

resp_pt is a list. A list is a useful structure for storing elements of different types. Data frames are special cases of lists where each list element has the same length (but where the list elements have different classes).

Lists are a very flexible data structure but can be very confusing because list elements can be lists themselves!

We can explore the structure of a list in two ways:

  • Entering View(list_object) in the Console. The triangle buttons on the left allow you to toggle dropdowns to explore list elements.
  • Using the str() (structure) function.

Using base R subsetting, we can access elements of a list in three ways:

  • By position with double square brackets [[:
# This gets the first element of the list
resp_pt[[1]]
  • By name with double square brackets [[: (note that list elements are not always named, so this won’t always be possible)
# Accessing by name directly
resp_pt[["status"]]

# Accessing via a variable
which_element <- "status"
resp_pt[[which_element]]
  • By name with a dollar sign $: (Helpful tip: For this mode of access, RStudio allows tab completion to fill in the full name)
resp_pt$status

We can retrieve these nested attributes by sequentially accessing the object keys from the outside in. For example, the meta element would be accessed as follows:

resp_pt$response$meta

Exercise: In the resp_pt object, retrieve the data associated with:

  • the copyright key
  • the number of hits (number of search results) within the meta object
  • the abstracts and leading paragraphs of the articles found in the search

Solutions

Board Game Geek API

Solution
req <- request("https://boardgamegeek.com/xmlapi2") %>% 
    req_url_path_append("search") %>% 
    req_url_query(query = I("mystery+curse"), type = I("boardgame,boardgameaccessory,boardgameexpansion"))

resp <- req_perform(req)  %>% resp_body_xml()
  1. Get the board game name (e.g., “Murder Mystery…” and “Robinson Crusoe”)
resp %>% xml_find_all("item/name") %>% xml_attr("value")
[1] "Murder Mystery Dinner Party: The Curse of the Green Lady"                             
[2] "Murder Mystery Evening: The Curse of the Mummy"                                       
[3] "Robinson Crusoe: Adventures on the Cursed Island – Mystery Cards I"                   
[4] "Robinson Crusoe: Adventures on the Cursed Island – Mystery Tales"                     
[5] "Robinson Crusoe: Adventures on the Cursed Island – Mystery Tales Upgrade Pack"        
[6] "Robinson Crusoe: Adventures on the Cursed Island – Mystery Tales – Insert Here Insert"
[7] "Robinson Crusoe: Adventures on the Cursed Island – Mystery Tales: e-Raptor Insert"    
[8] "Robinson Crusoe: Adventures on the Cursed Island – Mystery Cards I"                   
[9] "Robinson Crusoe: Adventures on the Cursed Island – Mystery Tales"                     
  1. Get the board game ID number (e.g., 63495, 40175)
resp %>% xml_find_all("item") %>% xml_attr("id")
[1] "63495"  "40175"  "256211" "238391" "289457" "291036" "309435" "256211"
[9] "238391"

New York Times API

Solution
resp_pt$copyright
resp_pt$response$meta$hits

resp_pt$response$docs$abstract
resp_pt$response$docs$lead_paragraph

# Both (abstract and leading paragraph) at once
resp_pt$response$docs[c("abstract", "lead_paragraph")]
resp_pt$response$docs %>%
    select(abstract, lead_paragraph)

After Class