Topic 11 Mini-Project

Learning Goals

Apply data wrangling and visualization skills to a new data set

You can download a template .Rmd of this activity here.

Data: Kiva

Kiva is a non-profit that allows people from around the world to lend small amounts to others to start or grow a business, go to school, access clean energy, etc. Since its founding in 2005, more than $1.2 billion in loans to over 3 million borrowers have been funded. In this activity, we’ll examine some lending data from 2005-2012.

Investigate the `partners` Data Table

Kiva has field partners who generally act as intermediaries between borrowers and Kiva (lenders). They evaluate borrower risk, post loan requests on Kiva, and process payments. The following command loads a table with data on Kiva’s field partners.

partners <- read_csv("https://bcheggeseth.github.io/112_fall_2022/data/kiva_partners2.csv")

Examine the codebook for the partners table.

Exercise 11.1 (Regional distribution) Make a summary table with only five columns: region (countries.region), total number of partners, total number of loans posted, total amount raised, and average loan size per loan posted.

The four columns after region should all be on a per region basis; for example, the row for Central America should include average loan size per loan posted in Central America. Sort your table by total amount raised.

Exercise 11.2 (Partners in Africa) Draw a map of all of the partners in Africa, with the size of the dot corresponding to the amount raised.

Investigate the `loans` Data Table

The loans table below contains information about individual loans to borrowers.

# a random sample of 10,000
loans <- read_csv("https://bcheggeseth.github.io/112_fall_2022/data/kiva_loans_small.csv")

Examine the codebook for the loans table. View the loans table and browse through some of the data (e.g., different sectors, uses, countries)

Exercise 11.3 (Preprocessing) Consider the loans data and perform the following preprocessing steps and save the new data as loans2.

Remove all rows where the funded amount is 0.
Note that the date information about when a loan request was posted is separated out into different fields for year, month, day, hour, etc. It would be more helpful to have a single field for the exact time at which each loan was posted. We’ll do this in three steps. First, create (mutate) a new column by pasting together the year, month, and date, separated by hyphens: post_dt=paste(posted_yr, posted_mo, posted_day, sep='-'). Second, create a new column with the time: post_time=paste(posted_hr,posted_min,posted_sec, sep=':'). Third, using the ymd_hms command from lubridate, add a new column with the full date (including time of day):
post_date=ymd_hms(paste(post_dt,post_time,sep=' ')).
Repeat the previous part to add a column funded_date with the exact time at which each loan was funded.
Create a new table called loans2 that only contains the following columns from loans: loan_id,status,funded_amount,paid_amount,sector,location.country,lat,lon,partner_id,post_date,funded_date

Exercise 11.4 (Loans by country) Consider the loans2 table.

Find the top 5 countries by number of loans.
Find the top 5 countries by total funded loan amount.

Exercise 11.5 (Sector analysis) Make a scatterplot with the number of loans in each sector on the x-axis and the average loan size in each sector on the y-axis. Instead of using points, use text with each sector’s name as the glyph.

Hint: start by wrangling the data into glyph-ready form, with one row corresponding to one glyph.

Putting them together

Exercise 11.6 (Join practice) Join the countries.region variable from the partners table onto the loans2 table, in order to have a region associated with each loan. Save the output table as loans3.

Exercise 11.7 (Defaults) Consider the following:

Starting with loans3, create a new table DefaultData that only includes those loans whose status is either “defaulted” or “paid”.
Make a density plot of the funded amount, split by laon status (“defaulted” or “paid”). That is, your plot should have two separate density curves on top of each other. Interpret the plot.
Make a bar chart with 10 bars, such that each bar is a $1000 range of funding amount; that is, the first bar should be for $0-$1000, the second for $1001-$2000, and so forth. The bars should all have height equal to 1, but be filled with two colors: one for the percentage of loans in that interval bin that defaulted and one for the percentage that were paid back in full. Interpret your graphic.
Starting with the data in DefaultData from part (a), make a table with four columns: partner_id, number of defaulted loans through that partner, number of loans completely paid back through that partner, and percentage of loans defaulted (the second column divided by the sum of the second and third columns). Sort your table from highest default percentage to lowest, and print out only those with at least a 10% default percentage. Hint: start by filtering out partners that have not had any defaulted loans.

Exercise 11.8 (Funding time) Consider the following:

Use the command days_to_fund = difftime(funded_date, post_date, units="days") within mutate() to add a column to the loans3 data that has the number of days between the time a loan was posted and the time it was funded. Save your new table as loans4.
The days_to_fund should always be positive, but there are some negative values, most likely due to data entry errors. Filter these out of the data table loans4.
Make a bar chart with days to fund by region. Reorder the x-axis so that the regions are in order of ascending days to fund.
Make a bar chart with days to fund by sector. Reorder the x-axis so that the sectors are in order of ascending days to fund.
Make a scatterplot with funded_amount on the x-axis and days_to_fund on the y-axis. Color the points by region and set the transparency to 0.2. Add linear trendlines for each region. What are the trends?

Exercise 11.9 (Your Own Question) Now that you are more familiar with the data, investigate your own research question and address it with a summary table or a data visualization.