Topic 11 Mini-Project
Learning Goals
- Apply data wrangling and visualization skills to a new data set
You can download a template .Rmd of this activity here.
Data: Kiva
Kiva is a non-profit that allows people from around the world to lend small amounts to others to start or grow a business, go to school, access clean energy, etc. Since its founding in 2005, more than $1.2 billion in loans to over 3 million borrowers have been funded. In this activity, we’ll examine some lending data from 2005-2012.
Investigate the partners
Data Table
Kiva has field partners who generally act as intermediaries between borrowers and Kiva (lenders). They evaluate borrower risk, post loan requests on Kiva, and process payments. The following command loads a table with data on Kiva’s field partners.
<- read_csv("https://bcheggeseth.github.io/112_fall_2022/data/kiva_partners2.csv") partners
Examine the codebook for the partners
table.
Exercise 11.1 (Regional distribution) Make a summary table with only five columns: region (countries.region
), total number of partners, total number of loans posted, total amount raised, and average loan size per loan posted.
The four columns after region should all be on a per region basis; for example, the row for Central America should include average loan size per loan posted in Central America. Sort your table by total amount raised.
Exercise 11.2 (Partners in Africa) Draw a map of all of the partners in Africa, with the size of the dot corresponding to the amount raised.
Investigate the loans
Data Table
The loans
table below contains information about individual loans to borrowers.
# a random sample of 10,000
<- read_csv("https://bcheggeseth.github.io/112_fall_2022/data/kiva_loans_small.csv") loans
Examine the codebook for the loans
table. View the loans
table and browse through some of the data (e.g., different sectors, uses, countries)
Exercise 11.3 (Preprocessing) Consider the loans
data and perform the following preprocessing steps and save the new data as loans2
.
- Remove all rows where the funded amount is 0.
- Note that the date information about when a loan request was posted is separated out into different fields for year, month, day, hour, etc. It would be more helpful to have a single field for the exact time at which each loan was posted. We’ll do this in three steps. First, create (mutate) a new column by pasting together the year, month, and date, separated by hyphens:
post_dt=paste(posted_yr, posted_mo, posted_day, sep='-')
. Second, create a new column with the time:post_time=paste(posted_hr,posted_min,posted_sec, sep=':')
. Third, using theymd_hms
command fromlubridate
, add a new column with the full date (including time of day):
post_date=ymd_hms(paste(post_dt,post_time,sep=' '))
.
- Repeat the previous part to add a column
funded_date
with the exact time at which each loan was funded.
- Create a new table called
loans2
that only contains the following columns fromloans
: loan_id,status,funded_amount,paid_amount,sector,location.country,lat,lon,partner_id,post_date,funded_date
Exercise 11.4 (Loans by country) Consider the loans2
table.
- Find the top 5 countries by number of loans.
- Find the top 5 countries by total funded loan amount.
Exercise 11.5 (Sector analysis) Make a scatterplot with the number of loans in each sector on the x-axis and the average loan size in each sector on the y-axis. Instead of using points, use text with each sector’s name as the glyph.
Hint: start by wrangling the data into glyph-ready form, with one row corresponding to one glyph.
Putting them together
Exercise 11.6 (Join practice) Join the countries.region
variable from the partners
table onto the loans2
table, in order to have a region associated with each loan. Save the output table as loans3
.
Exercise 11.7 (Defaults) Consider the following:
- Starting with
loans3
, create a new tableDefaultData
that only includes those loans whose status is either “defaulted” or “paid”.
- Make a density plot of the funded amount, split by laon status (“defaulted” or “paid”). That is, your plot should have two separate density curves on top of each other. Interpret the plot.
- Make a bar chart with 10 bars, such that each bar is a $1000 range of funding amount; that is, the first bar should be for $0-$1000, the second for $1001-$2000, and so forth. The bars should all have height equal to 1, but be filled with two colors: one for the percentage of loans in that interval bin that defaulted and one for the percentage that were paid back in full. Interpret your graphic.
- Starting with the data in
DefaultData
from part (a), make a table with four columns: partner_id, number of defaulted loans through that partner, number of loans completely paid back through that partner, and percentage of loans defaulted (the second column divided by the sum of the second and third columns). Sort your table from highest default percentage to lowest, and print out only those with at least a 10% default percentage. Hint: start by filtering out partners that have not had any defaulted loans.
Exercise 11.8 (Funding time) Consider the following:
- Use the command
days_to_fund = difftime(funded_date, post_date, units="days")
withinmutate()
to add a column to theloans3
data that has the number of days between the time a loan was posted and the time it was funded. Save your new table asloans4
.
- The
days_to_fund
should always be positive, but there are some negative values, most likely due to data entry errors. Filter these out of the data tableloans4
.
- Make a bar chart with days to fund by region. Reorder the x-axis so that the regions are in order of ascending days to fund.
- Make a bar chart with days to fund by sector. Reorder the x-axis so that the sectors are in order of ascending days to fund.
- Make a scatterplot with
funded_amount
on the x-axis anddays_to_fund
on the y-axis. Color the points by region and set the transparency to 0.2. Add linear trendlines for each region. What are the trends?
Exercise 11.9 (Your Own Question) Now that you are more familiar with the data, investigate your own research question and address it with a summary table or a data visualization.