1 Introduction
Learning Goals
- Explain the similarities and differences between time series, longitudinal, and spatial data.
- Explain why and how standard methods such as linear regression (estimated by ordinary least squares, OLS) fail on correlated data.
- Be comfortable with working with time/date data and the lubridate R package.
- Start to become comfortable playing with and manipulating data in R.
Slides from today are available here.
Introductions
Directions:
You’ll spend about 10 minutes in small groups. Then we’ll come back together as a larger group.
When we come back together, I’m going to ask everyone to take a turn and do the following:
- Introduce one of your group mates.
- Share one thing from your discussion about learning experiences.
So take a moment to:
Introduce yourselves in whatever way you feel appropriate (e.g. name, pronouns, majors/minors, how you’re feeling at the moment, things you’re looking forward to, why you are motivated to take this class). You can generate a random “ice breaker” at https://mackenziekbrooks.github.io/icebreaker-generator/.
Discuss the following:
- What are some things that have contributed to positive learning experiences in your courses that you would like to have in place for this course? What has contributed to negative experiences that you would like to prevent?
Group Activity: Playing with Data
In this course, we will spend time with real data as well as a bit of the theory of the statistical models and methods that we use for correlated data.
Next week will be more theory-heavy so that we can establish some important theoretical threads that we will weave through the different types of correlated data.
So today, we work with real, correlated data and start thinking about how we may plot temporal data (data collected over time) using standard methods and trying to come up with creative ways to explore this data.
Directions
- Download a template RMarkdown file to start from here.
- Each of you should work on your own RMarkdown file, but you need to check in with each other on each question. Help each other. Share ideas and insight.
- Open RStudio and the template file. Make sure you have installed the R packages you need for today.
We will be using the tidyverse (dplyr, ggplot2) today. If you have not seen these packages yet, I recommend checking out the data transformation and visualization cheatsheets (https://www.rstudio.com/resources/cheatsheets/). The symbol %>%
is a pipe and it passes whatever is to the left as the first argument to the function on the right.
#Run this R chunk first!
library(dplyr)
library(ggplot2)
library(googlesheets4) #install.packages('googlesheets4') if needed
library(lubridate) #install.packages('lubridate') if needed
#options(gargle_oob_default = TRUE) #run if you are working on Mac RStudio Server
Data Context: Temperatures & Energy Use
Minnesota winters are cold. In St. Paul, homes and apartments are typically 50 to 100 years old or older. This means they may not be energy efficient in that they may be drafty and without modern insulation.
To learn about energy use in a St. Paul home, Prof. Heggeseth’s family installed a Nest thermostat in the Dining Room in Jan 2019 to control the heat and record the temperature and energy use in their home (built in 1914). They installed another Nest thermostat Upstairs in July 2019 to control the air conditioning (cooling). Then in Fall 2019, they redid their heating system to have three zones and added a Basement thermostat.
Exercises
Today, we are going to explore real data that are continuously being collected, every five to ten minutes, and stored in a Google spreadsheet. With the code below, we read in the data stored in Google Sheets.
<- read_sheet("https://docs.google.com/spreadsheets/d/1UwySCYheUYWRyG3Vsc5e2l1NH2skN-qaUcnUohu3lRo/edit#gid=1236040373",col_types="Tcdddccdddccdddcdd") nest
Check out the variables in the data set. Note that Location1
, Location2
, Location3
are not going to be useful variables for us; they are constant, so I remove them with the code below.
<- nest %>%
nest select(-Location1,-Location2,-Location3)
head(nest) #prints first 6 rows of data frame
- List the variables and their type (categorical, continuous) and guess as to what they measure.
ANSWER:
Before we go any further, I want to introduce you to a package that will be very useful for you especially if we have variables of dates or time. The lubridate package! The documentation for the package is here: https://lubridate.tidyverse.org/
<- nest %>%
nest mutate(DateTime = force_tz(Date, "America/Chicago")) %>% #force the time zone to be Central Time
mutate(DateOnly = date(DateTime)) #pull out the date only
- Use the functions
hour()
andminute()
to create new variables for the hour in a day and minutes within the hour. Then create a quantitative variable innest
, calledTime
, that combines the hour and minute such that 9:30am is 9.5.
ANSWER:
#mutate adds variables to the data set
<- nest %>%
nest mutate() %>% #start with Hour and Minute
mutate() #then do Time
- Use the functions
day()
andmonth()
to create new variables for day within the month and month within the year. Then create a quantitative variable innest
, calledMonthDay
, that combines the month and day such that April 15 is 4.5. Hint:days_in_month()
will be useful to do this.
ANSWER:
- Remove the first day of measurements, Jan 18th, that has only the evening hours with
filter()
.
ANSWER:
- Create a categorical variable of weekdays labels with
wday()
, calledDayofWeek
.
ANSWER:
- Create a line plot (
geom_line
) withDateTime
andDiningTemp
. Comment on what patterns you notice, what you learn and questions you have.
ANSWER:
- Create a line plot (
geom_line
) withTime
andDiningTemp
, grouped byMonthDay
, colored by thefactor(DayofWeek)
, and faceted byMonth
. Comment on what patterns you notice, what you learn and questions you have.
ANSWER:
- Create a scatterplot (
geom_jitter
) ofDiningTemp
with theDiningPrevTemp
(temperature 5 or 10 minutes previously). Do the same for the observation 10-20 minutes previously. Comment on what patterns you notice, what you learn and questions you have.
ANSWER:
<- nest %>%
nest mutate(DiningPrevTemp = lag(DiningTemp), DiningPrevTemp2 = lag(DiningTemp, 2), DiningPrevTemp3 = lag(DiningTemp, 3))
- Create a tile plot (
geom_tile
) ofTime
againstMonthDay
filled byfactor(DiningMode)
, but only for the observations before May. Comment on what you patterns notice, what you learn and questions you have.
ANSWER:
- Explore the relationship between outside temperature and the heating and cooling systems. Create plots with short summaries of the insights you gain about the energy efficient of our home, the schedule of our thermostat, etc. Be creative. Think outside the box. Come up with what you want to plot and ask for help with coding questions.
Some ideas to get you started: summarize each day by the proportion of time the heater was on (or longest/shortest stretch of time with no heat or the max/min difference between Target Temp and Actual Temp) as well as summaries of temperature (min, mean, median, max).
ANSWER: