# Example 1: For loop with no storage container; append new value to existing vector
<- integer()
x <- bench::mark(
results1 for (i in 1:1e5) {
<- c(x, i)
x
}
)
# Example 2: For loop with a storage container!
<- vector('integer', length = 1e5)
x <- bench::mark(
results2 for (i in 1:1e5) {
<- i
x[i]
}
)
# Example 3: purrr::map()
<- bench::mark(purrr::map_int(1:1e5, ~.x))
results3
# Example 4: Standard R Function/Operator
<- bench::mark(seq(1,1e5)) results4
12 Loops and iteration - Part 2
Settling In
Data Storytelling Moment
Learning goals
After this lesson, you should be able to:
- Use the
map()
family of functions in thepurrr
package to handle repeated tasks that create output, instead offor
loops - Use the
walk()
family of functions in thepurrr
package to handle repeated tasks that need side-effects, instead offor
loops
Iteration review
Let’s write a for
loop that stores the class()
(type) of every column in the mtcars
data frame.
. . .
Yes, str()
does this plus more for you. Let’s consider replicating this task through low-level iteration.
. . .
Recall that:
vector("numeric/logical/character/list", length)
creates a storage container.mtcars[[1]]
accesses the first element/column of the data frame.
. . .
- First, write out the steps in words.
- Then translate each step to code.
Iteration with purrr
purrr
is a tidyverse
package that provides several useful functions for iteration.
Open up the purrr
cheatsheet.
. . .
The main advantages of purrr
include:
- Improved readability of R code
- Reduction in the “overhead” in writing a
for
loop- These functions create storage containers
- These functions do the
for (i in ...)
efficiently
. . .
Computational Time and Memory Allocation
Compare the four ways to create a vector of integers from 1 to 100,000.
# Compile Results
bind_rows(results1, results2, results3, results4) %>%
mutate(type = c("for_loop_no_storage", "for_loop_with_storage", "purrr_map", "standard_func")) %>%
select(type, mem_alloc, total_time) %>% print()
# A tibble: 4 × 3
type mem_alloc total_time
<chr> <bch:byt> <bch:tm>
1 for_loop_no_storage 18.6GB 20.8s
2 for_loop_with_storage 402.2KB 484.3ms
3 purrr_map 397.5KB 537.9ms
4 standard_func 0B 23ms
Iteration with map
In purrr
, we can use the family of map()
functions to iteratively apply a function to each element of a list or vector and save its output.
. . .
- Reminder: Data frames/tibbles are named lists so the elements are the “columns” so
map()
can iterate over columns of a dataset.
. . .
Let’s look at the purrr
cheatsheet to look at graphical representations of how these functions work.
map()
returns a listmap_chr()
returns a character vectormap_lgl()
returns a logical vectormap_int()
returns an integer vectormap_dbl()
returns a numeric vectormap_vec()
returns a vector of a different (non-atomic) type (like dates)
. . .
Let’s return to our original task to replicate str()
.
To get the class()
of each data frame column, map_chr()
is sensible because classes are strings:
map_chr(mtcars, class)
mpg cyl disp hp drat wt qsec vs
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
am gear carb
"numeric" "numeric" "numeric"
. . .
Let’s get the class of each variable in diamonds
:
map_chr(diamonds, class)
Error in `map_chr()`:
ℹ In index: 2.
ℹ With name: cut.
Caused by error:
! Result must be length 1, not 2.
. . .
What happened!?
map_chr()
was expecting to create a character vector with one element per element (column) in diamonds
. But something happened in column 2 with the cut
variable.
. . .
Let’s figure out what happened:
class(diamonds$cut)
[1] "ordered" "factor"
Ah! cut
has 2 classes. In this case, map()
(which returns a list) is the best option because some variables have multiple classes:
map(diamonds, class)
$carat
[1] "numeric"
$cut
[1] "ordered" "factor"
$color
[1] "ordered" "factor"
$clarity
[1] "ordered" "factor"
$depth
[1] "numeric"
$table
[1] "numeric"
$price
[1] "integer"
$x
[1] "numeric"
$y
[1] "numeric"
$z
[1] "numeric"
The error we encountered with map_chr()
is a nice feature of purrr
because it allows us to be very sure of the type of output we are getting.
. . .
Failing loudly is vastly preferable to getting unexpected outputs silently
. . .
Recall that we explored syntax and functions for selecting variables in a data frame via the tidy-select documentation for across
. (?dplyr_tidy_select
in the Console.)
We can also combine map_*()
functions with tidy selection for some powerful variable summaries that require much less code than for
loops.
# Mean of each numeric variable
map_dbl(diamonds %>% select(where(is.numeric)), mean)
carat depth table price x y
0.7979397 61.7494049 57.4571839 3932.7997219 5.7311572 5.7345260
z
3.5387338
# Number of unique values in each non-numeric variable
map_int(diamonds %>% select(!where(is.numeric)), n_distinct)
cut color clarity
5 7 8
Exercises
- We want to compute several summary statistics on each quantitative variable in a data frame and organize the results in a new data frame (rows = variables, columns = summary statistics).
- Write a function called
summ_stats()
that takes a numeric vectorx
as input and returns the mean, median, standard deviation, and IQR as a data frame. You can usetibble()
to create the data frame.- Example:
tibble(a = 1:2, b = 2:3)
creates a data frame with variablesa
andb
.
- Example:
- Use a
map*()
function frompurrr
to get the summary statistics for each quantitative variable indiamonds
. - Look up the
bind_rows()
documentation fromdplyr
to combine summary statistics for all quantitative variables into one data frame.- Note: You’ll notice that the variable names are not present in the output. Try to figure out a way to add variable names back in with
mutate()
andcolnames()
.
- Note: You’ll notice that the variable names are not present in the output. Try to figure out a way to add variable names back in with
Once you’ve done that with map()
, write a for
loop to achieve the same result. Which do you prefer in terms of ease of code writing and readability?
Iteration with map2
or pmap
If you need to iterate over the rows of a data frame, we’ll need other versions of map
().
. . .
purrr
also offers map2()
and the pmap()
family of functions that take multiple inputs and loops over them simultaneously.
Let’s look at the purrr
cheatsheet to look at graphical representations of how these functions work.
. . .
Let’s look at this code to randomly simulate data from Normal distributions.
<- tibble(
args n = c(100, 100, 100, 100), # this is n
mean = c(0, 1, 2, 3), # this is mean
sd = c(4, 3, 2, 1) # this is sd
)
# save randomly generated data in tibble with simulation parameters
<- args %>%
sim_data mutate(rand_data = pmap(args, rnorm))
sim_data
# A tibble: 4 × 4
n mean sd rand_data
<dbl> <dbl> <dbl> <list>
1 100 0 4 <dbl [100]>
2 100 1 3 <dbl [100]>
3 100 2 2 <dbl [100]>
4 100 3 1 <dbl [100]>
Note how the column names in args
exactly match the argument names in rnorm()
.
The iteration that is happening is across rows, and the multiple arguments in rnorm()
are being matched by name.
. . .
To access the information in that rand_data
list, we need to unnest()
this list! It converts/unfolds the information into a numeric vector (repeating the information in mean
, sd
, n
)
<- sim_data %>%
unnested_sim unnest(rand_data) # convert the rand_data list to numeric vector
%>%
unnested_sim ggplot(aes(x = rand_data, group = mean, fill = factor(mean))) +
geom_density(alpha = .5)
. . .
We can also use pmap()
to specify variations in some arguments but leave some arguments constant across the iterations:
<- tibble(
string_data pattern = c("p", "n", "h"),
replacement = c("P", "N", "H")
)
pmap_chr(string_data, str_replace_all, string = "ppp nnn hhh")
[1] "PPP nnn hhh" "ppp NNN hhh" "ppp nnn HHH"
Exercises
- Create 2 small examples that show how
pmap()
works withstr_sub()
. Your examples should:
- Use different arguments for
string
,start
, andend
- Use different arguments for
start
andend
but a fixedstring
- Last class we worked on an extended exercise where our goal was to write a series of functions and a
for
loop to repeat linear model fitting under different “settings” (removal of outliers, model formula choice).
Repeat this exercise using pmap()
.
- You’ll need to use the
df_arg_combos
object, yourremove_outliers()
function, and yourfit_model()
function. - Review the solutions below for the necessary parts and make sure you understand each part.
- Then, try to put it all together using
pmap()
.
<- diamonds %>% select(carat, cut, clarity, color) %>% names()
xs
<- crossing(
df_arg_combos mod_formula = str_c("price ~ ", map_chr(seq_along(xs), ~ str_c(xs[1:.x], collapse = " + ")) ),
remove_outliers = c(TRUE, FALSE)
) df_arg_combos
# A tibble: 8 × 2
mod_formula remove_outliers
<chr> <lgl>
1 price ~ carat FALSE
2 price ~ carat TRUE
3 price ~ carat + cut FALSE
4 price ~ carat + cut TRUE
5 price ~ carat + cut + clarity FALSE
6 price ~ carat + cut + clarity TRUE
7 price ~ carat + cut + clarity + color FALSE
8 price ~ carat + cut + clarity + color TRUE
<- function(data, what_var, sd_thresh) {
remove_outliers %>%
data mutate(zscore = ({{ what_var }} - mean({{ what_var }}, na.rm = TRUE))/sd({{ what_var }}, na.rm = TRUE)) %>%
filter(zscore <= sd_thresh)
}
<- function(data, mod_formula, remove_outliers) {
fit_model if (remove_outliers) {
<- remove_outliers(data, what_var = carat, sd_thresh = 3)
data_clean else {
} <- data
data_clean
}
lm(mod_formula, data = data_clean)
}
# use pmap()
Iteration with walk
If we don’t care about saving an output for each iteration, we can use the family of walk()
functions (walk
, walk2
, and pwalk
) to apply a function to each element of a list or vector with no output.
Let’s say, we wanted to visualize the relationship between carat
and price
by clarity.
If we wanted a fully stratified (fully separated) analysis, we could create one visualization per clarity category and save it separately.
First, let’s create separate datasets that are still contained in one tibble.
# Let's create separate datasets within one tibble
<- diamonds %>%
by_clarity group_nest(clarity)
by_clarity
# A tibble: 8 × 2
clarity data
<ord> <list<tibble[,9]>>
1 I1 [741 × 9]
2 SI2 [9,194 × 9]
3 SI1 [13,065 × 9]
4 VS2 [12,258 × 9]
5 VS1 [8,171 × 9]
6 VVS2 [5,066 × 9]
7 VVS1 [3,655 × 9]
8 IF [1,790 × 9]
# Create a variable with a image path
<- by_clarity %>%
by_clarity mutate(path = str_glue("diamonds-{clarity}.png"))
by_clarity
# A tibble: 8 × 3
clarity data path
<ord> <list<tibble[,9]>> <glue>
1 I1 [741 × 9] diamonds-I1.png
2 SI2 [9,194 × 9] diamonds-SI2.png
3 SI1 [13,065 × 9] diamonds-SI1.png
4 VS2 [12,258 × 9] diamonds-VS2.png
5 VS1 [8,171 × 9] diamonds-VS1.png
6 VVS2 [5,066 × 9] diamonds-VVS2.png
7 VVS1 [3,655 × 9] diamonds-VVS1.png
8 IF [1,790 × 9] diamonds-IF.png
# plot for making a scatterplot
<- function(df) {
carat_scatter ggplot(df, aes(x = carat, y = price)) + geom_point()
}
# Create a "variable" with each plots
<- by_clarity %>%
by_clarity mutate(plot = map(data, carat_scatter)) # iterate over data to create ggplot output
# Iterate over paths and plots to save the export images
walk2(
$path,
by_clarity$plot,
by_clarityfunction(path, plot){ ggsave(path, plot, width = 6, height = 6)}
)
Additional Resources
If you are interested in having a deeper understanding of the purrr
package, I recommend you check out:
- Functionals Chapter in Advanced R by Hadley Wickham
Solutions
Iteration Review
- .
Solution
<- vector("character", ncol(mtcars))
col_classes
# Data frames are **lists** of columns, so this loop iterates over the columns
for (i in seq_along(mtcars)) {
<- class(mtcars[[i]])
col_classes[i] }
Iteration with map
- .
Solution
# Function to summarize
<- function(x) {
summ_stats tibble(
mean = mean(x, na.rm = TRUE),
median = median(x, na.rm = TRUE),
sd = sd(x, na.rm = TRUE),
iqr = IQR(x, na.rm = TRUE)
)
}# Data Subset: Only Quantitative Variables
<- diamonds %>% select(where(is.numeric))
diamonds_num
# Option 1: map + bind_rows() + add variable names
map(diamonds_num, summ_stats) %>%
bind_rows() %>%
mutate(variable = colnames(diamonds_num))
# A tibble: 7 × 5
mean median sd iqr variable
<dbl> <dbl> <dbl> <dbl> <chr>
1 0.798 0.7 0.474 0.64 carat
2 61.7 61.8 1.43 1.5 depth
3 57.5 57 2.23 3 table
4 3933. 2401 3989. 4374. price
5 5.73 5.7 1.12 1.83 x
6 5.73 5.71 1.14 1.82 y
7 3.54 3.53 0.706 1.13 z
# Option 2: map + set variable names + bind_rows()
map(diamonds_num, summ_stats) %>%
set_names(colnames(diamonds_num)) %>%
bind_rows(.id = "variable")
# A tibble: 7 × 5
variable mean median sd iqr
<chr> <dbl> <dbl> <dbl> <dbl>
1 carat 0.798 0.7 0.474 0.64
2 depth 61.7 61.8 1.43 1.5
3 table 57.5 57 2.23 3
4 price 3933. 2401 3989. 4374.
5 x 5.73 5.7 1.12 1.83
6 y 5.73 5.71 1.14 1.82
7 z 3.54 3.53 0.706 1.13
# Compare with across():
%>%
diamonds summarize(across(where(is.numeric),
list(mean = ~ mean(.x, na.rm = TRUE), # list of functions
median = ~ median(.x, na.rm = TRUE),
sd = ~ sd(.x, na.rm = TRUE),
iqr = ~ IQR(.x, na.rm = TRUE)
%>% pivot_longer(everything(), names_to = 'var',values_to = 'value') %>%
))) separate_wider_delim(var,delim = '_', names = c('variable','summary')) %>%
pivot_wider(names_from = 'summary',values_from = 'value')
# A tibble: 7 × 5
variable mean median sd iqr
<chr> <dbl> <dbl> <dbl> <dbl>
1 carat 0.798 0.7 0.474 0.64
2 depth 61.7 61.8 1.43 1.5
3 table 57.5 57 2.23 3
4 price 3933. 2401 3989. 4374.
5 x 5.73 5.7 1.12 1.83
6 y 5.73 5.71 1.14 1.82
7 z 3.54 3.53 0.706 1.13
# with a for loop
<- vector("list", length = ncol(diamonds_num))
diamonds_summ_stats2
for (i in seq_along(diamonds_num)) {
<- summ_stats(diamonds_num[[i]])
diamonds_summ_stats2[[i]]
}%>%
diamonds_summ_stats2 set_names(colnames(diamonds_num)) %>%
bind_rows(.id = "variable")
# A tibble: 7 × 5
variable mean median sd iqr
<chr> <dbl> <dbl> <dbl> <dbl>
1 carat 0.798 0.7 0.474 0.64
2 depth 61.7 61.8 1.43 1.5
3 table 57.5 57 2.23 3
4 price 3933. 2401 3989. 4374.
5 x 5.73 5.7 1.12 1.83
6 y 5.73 5.71 1.14 1.82
7 z 3.54 3.53 0.706 1.13
Iteration with pmap
- .
Solution
<- tibble(
string_data string = c("apple", "banana", "cherry"),
start = c(1, 2, 4),
end = c(2, 3, 5)
)
pmap_chr(string_data, str_sub)
[1] "ap" "an" "rr"
<- tibble(
string_data start = c(1, 2, 4),
end = c(2, 3, 5)
)pmap_chr(string_data, str_sub, string = "abcde")
[1] "ab" "bc" "de"
- .
Solution
%>% mutate(fit_mod = pmap(df_arg_combos, fit_model, data = diamonds)) df_arg_combos
After Class
- Take a look at the Schedule page to see how to prepare for the next class.
- Quiz 2 in class on Wednesday.
- Focused on Wrangling, Functions, Base R, base R Iteration (not today’s content…)
- Work on Homework 6.