Homework 6: Wrangling + More TidyTuesday

NOTE: Exercises 1–6 are required. There are 2 additional optional exercises at the end. These will not be graded but are highly recommended practice.


Kiva

Exercise 1: Kiva partners

Kiva is a non-profit that allows people from around the world to lend small amounts to others to start or grow a business, go to school, access clean energy, etc. Since its founding in 2005, more than $1.2 billion in loans to over 3 million borrowers have been funded. In the remaining exercises, we’ll examine some Kiva data from 2005-2012. To begin, let’s explore data on Kiva’s field partners. These partners act as intermediaries between Kiva (the lenders) and borrowers. They evaluate borrower risk, post loan requests on Kiva, and process payments. Load data on the field partners below. A codebook with variable descriptions is here

# Load the tidyverse
library(tidyverse)

# Load the data
partners <- read_csv("https://mac-stat.github.io/data/kiva_partners2.csv")

Part a

Let’s get to know the data.

# Calculate the lowest, median, and highest total amount raised by any partner


# Identify the 6 partners that have raised the highest total amount
# Show just the partner names, countries, and total amount raised


# Show the names of the partners in Bolivia
# (Don't include any other variables, just the partner names)

Part b

Create a new table with only five columns:

  • countries.region
  • total_partners = total number of partners per region
  • total_loans = total number of loans posted per region
  • total_amount = total amount raised per region
  • average_loan = average loan size per loan posted in each region (calculated as total amount raised per region / total number of loans posted per region)

Print the entire table, sorted from high to low with respect to total_amount raised. NOTE:

  • Your table should have 7 rows and 5 columns.
  • Your first row should have countries.region = Asia, total_partners = 40, total_loans = 133060, total_amount = 84816225, average_loan = 637.

Part c

Identify two things that you learned from the table in Part b. Just pick whatever you found most interesting.

Part d

Draw a map that includes a dot for each of Kiva’s partners. Color the dot corresponding to the total amount raised by the partner. NOTE: It’s easier to do this on a static map, than with leaflet.

library(rnaturalearth)
library(mosaic)

# Get a background map of the entire world
# world_boundaries <- ___

# Plot the partner locations on the background map
# ggplot(___) + 
#   geom___() + 
#   geom_point(
#     data = ___,
#     aes(x = ___, y = ___, color = ___)
#   ) +
#   theme_map()






Exercise 2: Kiva loans (Part 1)

The loans data contains information on a sample of 10,000 individual loans to borrowers:

# a random sample of 10,000 loans
loans <- read_csv("https://mac-stat.github.io/data/kiva_loans_small.csv")

View the loans table, browse through some of the data, and check out the codebook. Before working with this data, we have to do some pre-processing / wrangling. Take the following steps and store the results as loans_2.

  1. Only keep the loans that have a positive funded_amount.

  2. The information about when a loan request was posted is separated out into different fields for year, month, day, hour, etc. Combine some of this information into a single variable that records the exact posting time. Do this in three steps. NOTE: You’ll get some warning messages, but these are not errors.

    • Define a new variable which records the exact date a loan request was posted by pasting together the year, month, and date of the request, separated by hyphens: post_dt = paste(posted_yr, posted_mo, posted_day, sep = '-')
    • Define a new variable which records the exact time of day a loan request was posted by pasting together the hour and minute of the request, separated by a colon:
      post_time = paste(posted_hr, posted_min, posted_sec, sep = ':').
    • Define a new variable which combines the exact date and time of day a loan request was posted:
      post_date = ymd_hms(paste(post_dt, post_time, sep = ' '))
  3. Similar to post_date, define a new variable called fund_date that reports the exact date and time at which each loan was funded (not posted).

  4. Define a new variable called days_to_fund = difftime(fund_date, post_date, units = "days") which records the number of days between the time a loan was posted and the time it was funded.

  5. Get information about the countries.region for the partner of each loan from the partners dataset. NOTE: partners and loans have many variable names in common. To avoid conflicts, simplify the partners dataset to just 2 columns before connecting it to loans.

  6. Keep only the following columns: loan_id, status, funded_amount, paid_amount, sector, countries.region, location.country, lat, lon, partner_id, post_date, fund_date, days_to_fund

# Define loans_2


# Confirm that loans_2 has 9884 rows and 13 columns





Exercise 3: Kiva loans (Part 2)

Part a

# Show the top 5 countries by number of loans


# Show the top 5 countries by total funded loan amount

Part b

Plot the mean loan size in each sector (y-axis) vs the number of loans in each sector (x-axis). Represent each sector by its name (text), not a point. HINT: You’ll have to calculate the mean loan size and number of loans in each sector before you can plot them.





TidyTuesday

As with Homework 3, you will pick a TidyTuesday dataset and do a quick analysis. There are several goals:

  1. Practice generating your own research questions.
  2. Practice identifying what viz and wrangling tools are useful for addressing your questions.
  3. Hone your visualization and wrangling skills. Be creative while also maintaining the integrity of the graph.
  4. Get a sense of the broader data science community. Check out what people share out on X / Twitter using the #TidyTuesday hashtag. Maybe even share your own #TidyTuesday work on social media.

NOTE: Though you’re encouraged to work with others, all code and words must be your own.



Exercise 4: Data

Part a

Go to TidyTuesday. Pick a dataset that was posted in October 2024 and that you did not use in Homework 3. Here, include:

  • A short (2–4 sentence) written description of your data. This should include:
    • the original data source (where did TidyTuesday get the data from?)
    • units of observation (what are you analyzing?)
    • data size (how many data points do you have? how many variables are measured on each data point?)
    • a discussion of any wrangling you needed to do before analyzing this data
  • Code to import the data and support the facts you cited in your short written description.

Part b

Write a clear research question related to your data here. This question must be nuanced enough to require analysis of the data.

Research Question:





Exercise 5: Viz

Part a

Construct 2 visualizations that address the research question you identified above. Directions:

  • Do not include more than 2 viz – editing is a skill! Though you can only include 2 viz here, you should / will need to make several viz before finalizing your selection.
  • At least 1 of your visualizations must incorporate 3 or more variables.
  • Each viz must…
    • have meaningful axis labels and legend titles
    • have a figure caption (fig.cap)
    • use alt text (fig.alt)
    • use a color-blind friendly color palette
  • Challenge yourself! Remember that growth is a learning goal in this course.

Part b

Write a brief (2-4 sentence) summary of what you learn from the viz. This should connect back to your research question and be “professional” – pay attention to spelling, punctuation, grammar, capitalization, etc.

Discussion:





Exercise 6: Wrangling

Part a

Use your wrangling tools to help address 3 follow-up questions you have from your viz, that further address your research question. Directions:

  • You should have 3 sets of wrangling code. Each set should be preceded by a comment (#) which includes the follow-up question it addresses.
  • At least 2 sets of wrangling code should demonstrate the combination of multiple wrangling verbs, hence span multiple lines.
  • All sets of wrangling code must incorporate tools we’ve learned in the data wrangling unit (activities 8 and up).

Part b

Write a brief (2-4 sentence) summary of what you learn from the wrangling results. This should connect back to your research question and be “professional” – pay attention to spelling, punctuation, grammar, capitalization, etc.





Optional exercises

NOTE: The exercises below won’t be graded, but they are strongly recommended as additional practice for the quiz.

OPTIONAL exercise: funding time

How many days does it take borrowers to get a loan funded? Let’s explore the days_to_fund variable in the loans_2.

Part a

Construct a univariate visualization of days_to_fund.

Part b

Construct a plot of days_to_fund (y-axis) vs funded_amount (x-axis) for each loan in loans_2.

Part c

Construct a plot of days_to_fund (y-axis) vs funded_amount (x-axis) in each countries.region. Don’t represent each loan on this plot. Rather, include curves to represent the trends in each region.

Part d

Summarize, in words, some takeaway messages from parts a-c about how long it takes to get a loan funded.





OPTIONAL exercise: loan status

The status variable in loans_2 indicates the status of a loan:

# Check out a table summary after you've defined loans_2
# loans_2 %>% 
#   count(status)

Let’s focus here on the patterns in loans that were paid back and those that were defaulted, i.e. not paid back.

Part a

Define a new dataset, defaults, that only includes loans that were either defaulted or paid.

# Define defaults

# Confirm that defaults has 7010 rows and 13 columns

Part b

Using defaults, construct a visualization of the relationship between the funded amount and status of a loan.

Part c

Define a new dataset with four columns:

  • partner_id
  • number of defaulted loans through that partner
  • number of loans completely paid back through that partner
  • percentage of loans defaulted

Sort your table from highest default percentage to lowest, and print out only those with at least a 50% default percentage. HINT: You’ll have to reshape the data in this process.

Part d

Provide some take-away messages about loan defaults using your results in parts a-d.





Render this file and submit the .html file to Moodle. This must include all code you used, as well as the output from that code.