::get_acs(
tidycensusyear = 2020,
state = "MN",
geography = "tract",
variables = c("B01003_001", "B19013_001"),
output = "wide",
geometry = TRUE
)
14 APIs - Part 2
Settling In
Sit with your project group.
You can download a template Quarto file to start from here. Put this file in a folder called data_acquisition
within a folder for this course.
Announcements
- When Data Acquisition / Cleaning takes a lot of computational power:
- Put eval=FALSE in the computationally heavy code chunks in a qmd file (for a skills challenge)
- OR Put the code in a cleaning.R file (for the project)
- After running the computationally heavy cleaning code, save the data locally with
write_csv(object_name, file = 'cleandata.csv')
and read it back in withread_csv('cleandata.csv')
for your analysis.
Data Storytelling Moment
Learning goals
After this lesson, you should be able to:
- Explain what an API is
- Set up an API key for a public API
- Develop comfort in using a URL-method of calling a web API
- Recognize the structure in a URL for a web API and adjust for your purposes
- Explore and subset complex nested lists
APIs
An API stands for Application Programming Interface, and this term describes a general class of tool that allows computer software, rather than humans, to interact with an organization’s data.
- Application refers to software.
- Interface can be thought of as a contract of service between two applications.
- This contract defines how the two communicate with each other using requests and responses.
Every API has documentation for how software developers should structure requests for data / information and in what format to expect responses.
Web APIs
Web APIs, or Web Application Programming Interfaces, which focus on transmitting requests and responses for raw data through a web browser.
- Our browsers communicate with web servers using a technology called HTTP or Hypertext Transfer Protocol.
- Programming languages such as R can also use HTTP to communicate with web servers.
URL
Every document on the Web has a unique address. This address is known as Uniform Resource Locator (URL).
Every URL has the same general structure. Let’s look at this example:
https://api.census.gov/data/2019/acs/acs1?get=NAME,B02015_009E,B02015_009M&for=state:*
https://api.census.gov
: This is the base URL.http://
: The scheme, which tells your browser or program how to communicate with the web server. This will typically be eitherhttp:
orhttps:
.api.census.gov
: The hostname or host address, which is a name that identifies the web server that will process the request.
data/2019/acs/acs1
: The file path, which tells the web server how to get to the desired resource.?get=NAME,B02015_009E,B02015_009M&for=state:*
: The query string or query parameters, which provide the parameters for the function you would like to call.- This is a string of key-value pairs separated by
&
. That is, the general structure of this part iskey1=value1&key2=value2
.
- This is a string of key-value pairs separated by
key | value |
---|---|
get | NAME,B02015_009E,B02015_009M |
for | state:* |
Example Web API’s
A large variety of web APIs provide data. Almost all reasonably large commercial websites offer Web APIs.
Todd Motto has compiled an expansive list of Public Web APIs on GitHub. Browse this list to see what data sources are available.
Wrapper packages
In R, it is easiest to access Web APIs through a wrapper package, an R package written specifically for a particular Web API.
- The R development community has already contributed wrapper packages for most large Web APIs.
- To find a wrapper package, search the web for “R package” and the name of the website. For example:
- Searching for “R Reddit package” returns RedditExtractor
- Searching for “R Weather.com package” returns weatherData
- rOpenSci also has a good collection of wrapper packages.
In our work with maps, we’ve used the tidycensus
package to obtain census data to display on maps. tidycensus
is a wrapper package that makes it easy to obtain desired census information:
Extra resources:
tidycensus
: wrapper package that provides an interface to a few census datasets with map geometry included- Full documentation is available at https://walker-data.com/tidycensus/
censusapi
: wrapper package that offers an interface to all census datasets- Full documentation is available at https://www.hrecht.com/censusapi/
What is going on behind the scenes with get_acs()
? Let’s look at access Web API’s directly with URL’s.
Accessing web APIs directly
Getting a Census API key
Many APIs require users to obtain a key to use their services.
- This lets organizations keep track of what data is being used.
- It also rate limits their API and ensures programs don’t make too many requests per day/minute/hour. Be aware that most APIs do have rate limits — especially for their free tiers.
Navigate to https://api.census.gov/data/key_signup.html to obtain a Census API key:
- Organization: Macalester College
- Email: Your Mac email address
You will get the message:
Your request for a new API key has been successfully submitted. Please check your email. In a few minutes you should receive a message with instructions on how to activate your new key.
Check your email. Copy and paste your key into a new text file:
- File > New File > Text File (towards the bottom of the menu)
- Save as
census_api_key.txt
in the same folder as this.qmd
.
Read in the key with the following code:
<- readLines("census_api_key.txt") census_api_key
Building a URL with httr2
We will use the httr2
package to build up a full URL from its parts because of URLs need to be percent encoded.
request()
creates an API request object using the base URLreq_url_path_append()
builds up the URL by adding path components separated by/
req_url_query()
adds the?
separating the endpoint from the query and sets the key-value pairs in the query- The
.multi
argument controls how multiple values for a given key are combined. - The
I()
function around"state:*"
inhibits parsing of special characters like:
and*
. (It’s known as the “as-is” function.) - The backticks around
for
are needed becausefor
is a reserved word in R (for for-loops). You’ll need backticks whenever the key name has special characters (like spaces, dashes). - We can see from here that providing an API key is achieved with
key=YOUR_API_KEY
.
- The
<- request("https://api.census.gov") %>%
req req_url_path_append("data") %>%
req_url_path_append("2019") %>%
req_url_path_append("acs") %>%
req_url_path_append("acs1") %>%
req_url_query(get = c("NAME", "B02015_009E", "B02015_009M"), `for` = I("state:*"), key = census_api_key, .multi = "comma")
Why would we ever use these steps instead of just using the full URL as a string?
- To generalize this code with functions!
- To handle special characters
- e.g., query parameters might have spaces, which need to be represented in a particular way in a URL (URLs can’t contain spaces)
Sending a request with httr2
Once we’ve fully constructed our request, we can use req_perform()
to send out the API request and get a response.
<- req_perform(req) resp
Getting a response with httr2
We see from Content-Type
that the format of the response is something called JSON. We can navigate to the request URL to see the structure of this output.
- We can use
resp_body_json()
inhttr2
to parse the JSON into a nicer R format.- This function uses
fromJSON()
behind the scenes. - Without
simplifyVector = TRUE
, the JSON is read in as a list.
- This function uses
<- resp %>% resp_body_json(simplifyVector = TRUE)
resp_json_df
# Data Cleaning
<- janitor::row_to_names(resp_json_df, 1) %>% # Move 1st row to Names
resp_json_df as.data.frame() %>% # Convert Matrix to Data Frame
mutate(across(starts_with('B'), as.numeric)) # Convert all variables that start with B to numeric
head(resp_json_df)
NAME B02015_009E B02015_009M state
1 Mississippi NA NA 28
2 Missouri 953 1141 29
3 Montana NA NA 30
4 Nebraska 412 477 31
5 Nevada 863 745 32
6 New Hampshire NA NA 33
To learn more about JSON, consult the following readings:
- A Non-Programmer’s Introduction to JSON
- Getting Started With JSON and jsonlite
- Fetching JSON data from REST APIs
Exercises
Go to HW7_Part2.qmd. Work on those exercises to get practice building URL’s to work with API’s.
Don’t forget to come back to review this activity to see more API examples below.
More API Examples
Board Game Geek & XML Data
The Board Game Geek API is referenced in the Games & Comics section of toddmotto’s public API list.
Our goal is to use the search API at the bottom of the page.
Let’s start at the top of the API documentation page to see how to navigate this reference.
- We can see from the XML references at the top that we will be expecting a new output format: XML stands for Extensible Markup Language
- The “Root Path” section tells us the base URL for the Board Game Geeks API endpoints and related APIs: https://boardgamegeek.com/xmlapi2/
- The “Search” section at the bottom of the page tells us:
- the path for the search endpoint (
/search
) - what query parameters are possible
- particular formatting instructions for query parameter values
- the path for the search endpoint (
The following request searches for board games, board game accessories, and board game expansions with the words “mystery” and “curse” in the title:
<- request("https://boardgamegeek.com/xmlapi2") %>%
req req_url_path_append("search") %>%
req_url_query(query = I("mystery+curse"), type = I("boardgame,boardgameaccessory,boardgameexpansion"))
When we use req_perform()
, we see from Content-Type
that the format of the response is something called XML. We can navigate to the request URL to see the structure of this output.
- XML (Extensible Markup Language) is a tree structure of named nodes and attributes.
- We can use
resp_body_xml()
to read in the XML as an R object.
<- req_perform(req)
resp
resp<- resp_body_xml(resp)
resp resp
{xml_document}
<items total="9" termsofuse="https://boardgamegeek.com/xmlapi/termsofuse">
[1] <item type="boardgame" id="63495">\n <name type="primary" value="Murder ...
[2] <item type="boardgame" id="40175">\n <name type="primary" value="Murder ...
[3] <item type="boardgame" id="256211">\n <name type="primary" value="Robins ...
[4] <item type="boardgame" id="238391">\n <name type="primary" value="Robins ...
[5] <item type="boardgameaccessory" id="289457">\n <name type="primary" valu ...
[6] <item type="boardgameaccessory" id="291036">\n <name type="primary" valu ...
[7] <item type="boardgameaccessory" id="309435">\n <name type="alternate" va ...
[8] <item type="boardgameexpansion" id="256211">\n <name type="primary" valu ...
[9] <item type="boardgameexpansion" id="238391">\n <name type="primary" valu ...
The XML output is not packaged in a nice way. (We’d love to have a data frame.) We can use the xml2
package to explore and navigate the XML structure to extract the information we need.
Let’s first use the xml_structure()
function to see how information is organized:
xml_structure(resp)
<items [total, termsofuse]>
<item [type, id]>
<name [type, value]>
<yearpublished [value]>
<item [type, id]>
<name [type, value]>
<yearpublished [value]>
<item [type, id]>
<name [type, value]>
<yearpublished [value]>
<item [type, id]>
<name [type, value]>
<yearpublished [value]>
<item [type, id]>
<name [type, value]>
<yearpublished [value]>
<item [type, id]>
<name [type, value]>
<item [type, id]>
<name [type, value]>
<yearpublished [value]>
<item [type, id]>
<name [type, value]>
<yearpublished [value]>
<item [type, id]>
<name [type, value]>
<yearpublished [value]>
The key navigation and extraction functions in xml2
are:
xml_children()
: Get nodes that are nested inside- Like getting the first level bullet points inside a given bullet point
xml_find_all()
: Finds nodes matching an XPath expression (XPath stands for XML path)- XPath expressions are like string regular expressions for XML trees
- See here for a deeper dive into XPath
xml_attr()
: Selects the value of an attribute (the information to the right of the=
in quotes)<node_name attribute_name1="attribute_value1" attribute_name2="attribute_value2">
# Get the item nodes in 2 different ways
%>% xml_find_all("item") resp
{xml_nodeset (9)}
[1] <item type="boardgame" id="63495">\n <name type="primary" value="Murder ...
[2] <item type="boardgame" id="40175">\n <name type="primary" value="Murder ...
[3] <item type="boardgame" id="256211">\n <name type="primary" value="Robins ...
[4] <item type="boardgame" id="238391">\n <name type="primary" value="Robins ...
[5] <item type="boardgameaccessory" id="289457">\n <name type="primary" valu ...
[6] <item type="boardgameaccessory" id="291036">\n <name type="primary" valu ...
[7] <item type="boardgameaccessory" id="309435">\n <name type="alternate" va ...
[8] <item type="boardgameexpansion" id="256211">\n <name type="primary" valu ...
[9] <item type="boardgameexpansion" id="238391">\n <name type="primary" valu ...
%>% xml_children() resp
{xml_nodeset (9)}
[1] <item type="boardgame" id="63495">\n <name type="primary" value="Murder ...
[2] <item type="boardgame" id="40175">\n <name type="primary" value="Murder ...
[3] <item type="boardgame" id="256211">\n <name type="primary" value="Robins ...
[4] <item type="boardgame" id="238391">\n <name type="primary" value="Robins ...
[5] <item type="boardgameaccessory" id="289457">\n <name type="primary" valu ...
[6] <item type="boardgameaccessory" id="291036">\n <name type="primary" valu ...
[7] <item type="boardgameaccessory" id="309435">\n <name type="alternate" va ...
[8] <item type="boardgameexpansion" id="256211">\n <name type="primary" valu ...
[9] <item type="boardgameexpansion" id="238391">\n <name type="primary" valu ...
# Get the item "type"
%>% xml_find_all("item") %>% xml_attr("type") resp
[1] "boardgame" "boardgame" "boardgame"
[4] "boardgame" "boardgameaccessory" "boardgameaccessory"
[7] "boardgameaccessory" "boardgameexpansion" "boardgameexpansion"
# The <name> and <yearpublished> nodes are nested within each <item>
%>% xml_find_all("item/name") resp
{xml_nodeset (9)}
[1] <name type="primary" value="Murder Mystery Dinner Party: The Curse of the ...
[2] <name type="primary" value="Murder Mystery Evening: The Curse of the Mumm ...
[3] <name type="primary" value="Robinson Crusoe: Adventures on the Cursed Isl ...
[4] <name type="primary" value="Robinson Crusoe: Adventures on the Cursed Isl ...
[5] <name type="primary" value="Robinson Crusoe: Adventures on the Cursed Isl ...
[6] <name type="primary" value="Robinson Crusoe: Adventures on the Cursed Isl ...
[7] <name type="alternate" value="Robinson Crusoe: Adventures on the Cursed I ...
[8] <name type="primary" value="Robinson Crusoe: Adventures on the Cursed Isl ...
[9] <name type="primary" value="Robinson Crusoe: Adventures on the Cursed Isl ...
%>% xml_find_all("item/yearpublished") # Notice that this is length 8 instead of 9! resp
{xml_nodeset (8)}
[1] <yearpublished value="2008"/>
[2] <yearpublished value="2007"/>
[3] <yearpublished value="2018"/>
[4] <yearpublished value="2019"/>
[5] <yearpublished value="2019"/>
[6] <yearpublished value="2019"/>
[7] <yearpublished value="2018"/>
[8] <yearpublished value="2019"/>
# Get the "primary" or "alternate" designation for each name
%>% xml_find_all("item/name") %>% xml_attr("type") resp
[1] "primary" "primary" "primary" "primary" "primary" "primary"
[7] "alternate" "primary" "primary"
Exercises: (not part of Homework)
- Get the board game name (e.g., “Murder Mystery…” and “Robinson Crusoe”)
- Get the board game ID number (e.g., 63495, 40175)
New York Times API
This example will build on the New York Times Web API, which provides access to news articles, movie reviews, book reviews, and many other data.
We will specifically focus on the Article Search API, which finds information about news articles that contain a particular word or phrase.
To get started with the NY Times API, you must register and get an authentication key. Signup only takes a few seconds, and it lets the New York Times make sure nobody abuses their API for commercial purposes. It also rate limits their API and ensures programs don’t make too many requests per day. For the NY Times API, this limit is 1000 calls per day.
Once you have signed up, verified your email, log back in to https://developer.nytimes.com. Under your email address, click on Apps and Create a new App (call it First API) and enable Article Search API, then press Save. This creates an authentication key, which is a 32 digit string with numbers and the letters a-e.
As with your census API key, save this key in a .txt
file, and read it in and store this in a variable called times_key
.
<- read_lines("nyt_api_key.txt") times_key
Open this URL in your browser (you should replace MY_KEY
with the API key you were given).
http://api.nytimes.com/svc/search/v2/articlesearch.json?q=gamergate&api-key=MY_KEY
The text you see in the browser is the response data (in JSON format).
This URL has the same structure that we discussed above for the census API:
http://
— The scheme, which tells your browser or program how to communicate with the web server. This will typically be eitherhttp:
orhttps:
.api.nytimes.com
— The hostname, which is a name that identifies the web server that will process the request./svc/search/v2/articlesearch.json
— The path, which tells the web server what function you would like to call (a function for searching articles).?q=gamergate&api-key=MY_KEY
— The query parameters, which provide the parameters for the function you would like to call. The key value pairs are the following:
key | value |
---|---|
q | gamergate |
api-key | MY_KEY |
The scheme, hostname, and path (http://api.nytimes.com/svc/search/v2/articlesearch.json
) together form the endpoint for the API call.
We can use the httr2
package to build up a full URL from its parts:
<- request("http://api.nytimes.com") %>%
req req_url_path_append("svc") %>%
req_url_path_append("search") %>%
req_url_path_append("v2") %>%
req_url_path_append("articlesearch.json") %>%
req_url_query(q = "gamergate", `api-key` = times_key)
We can write a function to generate the URL for a user-specified query:
<- function(query, key) {
create_nyt_url request("http://api.nytimes.com") %>%
req_url_path_append("svc") %>%
req_url_path_append("search") %>%
req_url_path_append("v2") %>%
req_url_path_append("articlesearch.json") %>%
req_url_query(q = query, `api-key` = key)
}
Let’s use this function to find articles related to:
Ferris Bueller's Day Off
(note the spaces and the apostrophe)Penn & Teller
(note the spaces and the punctuation mark&
)
Let’s see how these queries are translated into the URLs:
<- create_nyt_url(query = "Ferris Bueller's Day Off", key = times_key)
req_fb <- create_nyt_url(query = "Penn & Teller", key = times_key) req_pt
We can use req_perform()
to send out the request and resp_body_json()
to parse the resulting JSON:
<- req_pt %>% req_perform() %>% resp_body_json(simplifyVector = TRUE) resp_pt
Exploring complex lists
resp_pt
is a list. A list is a useful structure for storing elements of different types. Data frames are special cases of lists where each list element has the same length (but where the list elements have different classes).
Lists are a very flexible data structure but can be very confusing because list elements can be lists themselves!
We can explore the structure of a list in two ways:
- Entering
View(list_object)
in the Console. The triangle buttons on the left allow you to toggle dropdowns to explore list elements. - Using the
str()
(structure) function.
Using base R subsetting, we can access elements of a list in three ways:
- By position with double square brackets
[[
:
# This gets the first element of the list
1]] resp_pt[[
- By name with double square brackets
[[
: (note that list elements are not always named, so this won’t always be possible)
# Accessing by name directly
"status"]]
resp_pt[[
# Accessing via a variable
<- "status"
which_element resp_pt[[which_element]]
- By name with a dollar sign
$
: (Helpful tip: For this mode of access, RStudio allows tab completion to fill in the full name)
$status resp_pt
We can retrieve these nested attributes by sequentially accessing the object keys from the outside in. For example, the meta
element would be accessed as follows:
$response$meta resp_pt
Exercise: In the resp_pt
object, retrieve the data associated with:
- the
copyright
key - the number of
hits
(number of search results) within themeta
object - the abstracts and leading paragraphs of the articles found in the search
Solutions
Board Game Geek API
Solution
<- request("https://boardgamegeek.com/xmlapi2") %>%
req req_url_path_append("search") %>%
req_url_query(query = I("mystery+curse"), type = I("boardgame,boardgameaccessory,boardgameexpansion"))
<- req_perform(req) %>% resp_body_xml() resp
- Get the board game name (e.g., “Murder Mystery…” and “Robinson Crusoe”)
%>% xml_find_all("item/name") %>% xml_attr("value") resp
[1] "Murder Mystery Dinner Party: The Curse of the Green Lady"
[2] "Murder Mystery Evening: The Curse of the Mummy"
[3] "Robinson Crusoe: Adventures on the Cursed Island – Mystery Cards I"
[4] "Robinson Crusoe: Adventures on the Cursed Island – Mystery Tales"
[5] "Robinson Crusoe: Adventures on the Cursed Island – Mystery Tales Upgrade Pack"
[6] "Robinson Crusoe: Adventures on the Cursed Island – Mystery Tales – Insert Here Insert"
[7] "Robinson Crusoe: Adventures on the Cursed Island – Mystery Tales: e-Raptor Insert"
[8] "Robinson Crusoe: Adventures on the Cursed Island – Mystery Cards I"
[9] "Robinson Crusoe: Adventures on the Cursed Island – Mystery Tales"
- Get the board game ID number (e.g., 63495, 40175)
%>% xml_find_all("item") %>% xml_attr("id") resp
[1] "63495" "40175" "256211" "238391" "289457" "291036" "309435" "256211"
[9] "238391"
New York Times API
Solution
$copyright
resp_pt$response$meta$hits
resp_pt
$response$docs$abstract
resp_pt$response$docs$lead_paragraph
resp_pt
# Both (abstract and leading paragraph) at once
$response$docs[c("abstract", "lead_paragraph")]
resp_pt$response$docs %>%
resp_ptselect(abstract, lead_paragraph)
After Class
- Take a look at the Schedule page to see how to prepare for the next class
- Work on finishing Homework 7.