1 Introduction

Learning Goals

  • Explain the similarities and differences between time series, longitudinal, and spatial data.
  • Explain why and how standard methods such as linear regression (estimated by ordinary least squares, OLS) fail on correlated data.
  • Be comfortable with working with time/date data and the lubridate R package.
  • Start to become comfortable playing with and manipulating data in R.


Slides from today are available here.




Introductions

Directions:

You’ll spend about 10 minutes in small groups. Then we’ll come back together as a larger group.

When we come back together, I’m going to ask everyone to take a turn and do the following:

  • Introduce one of your group mates.
  • Share one thing from your discussion about learning experiences.

So take a moment to:

  • Introduce yourselves in whatever way you feel appropriate (e.g. name, pronouns, majors/minors, how you’re feeling at the moment, things you’re looking forward to, why you are motivated to take this class). You can generate a random “ice breaker” at https://mackenziekbrooks.github.io/icebreaker-generator/.

  • Discuss the following:

    • What are some things that have contributed to positive learning experiences in your courses that you would like to have in place for this course? What has contributed to negative experiences that you would like to prevent?

Group Activity: Playing with Data

In this course, we will spend time with real data as well as a bit of the theory of the statistical models and methods that we use for correlated data.

Next week will be more theory-heavy so that we can establish some important theoretical threads that we will weave through the different types of correlated data.

So today, we work with real, correlated data and start thinking about how we may plot temporal data (data collected over time) using standard methods and trying to come up with creative ways to explore this data.

Directions

  1. Download a template RMarkdown file to start from here.
  2. Each of you should work on your own RMarkdown file, but you need to check in with each other on each question. Help each other. Share ideas and insight.
  3. Open RStudio and the template file. Make sure you have installed the R packages you need for today.

We will be using the tidyverse (dplyr, ggplot2) today. If you have not seen these packages yet, I recommend checking out the data transformation and visualization cheatsheets (https://www.rstudio.com/resources/cheatsheets/). The symbol %>% is a pipe and it passes whatever is to the left as the first argument to the function on the right.

#Run this R chunk first!
library(dplyr)
library(ggplot2)
library(googlesheets4) #install.packages('googlesheets4') if needed
library(lubridate) #install.packages('lubridate') if needed

#options(gargle_oob_default = TRUE) #run if you are working on Mac RStudio Server

Data Context: Temperatures & Energy Use

Minnesota winters are cold. In St. Paul, homes and apartments are typically 50 to 100 years old or older. This means they may not be energy efficient in that they may be drafty and without modern insulation.

To learn about energy use in a St. Paul home, Prof. Heggeseth’s family installed a Nest thermostat in the Dining Room in Jan 2019 to control the heat and record the temperature and energy use in their home (built in 1914). They installed another Nest thermostat Upstairs in July 2019 to control the air conditioning (cooling). Then in Fall 2019, they redid their heating system to have three zones and added a Basement thermostat.

Exercises

Today, we are going to explore real data that are continuously being collected, every five to ten minutes, and stored in a Google spreadsheet. With the code below, we read in the data stored in Google Sheets.

nest  <- read_sheet("https://docs.google.com/spreadsheets/d/1UwySCYheUYWRyG3Vsc5e2l1NH2skN-qaUcnUohu3lRo/edit#gid=1236040373",col_types="Tcdddccdddccdddcdd")

Check out the variables in the data set. Note that Location1, Location2, Location3 are not going to be useful variables for us; they are constant, so I remove them with the code below.

nest <- nest %>%
  select(-Location1,-Location2,-Location3)

head(nest) #prints first 6 rows of data frame
  1. List the variables and their type (categorical, continuous) and guess as to what they measure.

ANSWER:

Before we go any further, I want to introduce you to a package that will be very useful for you especially if we have variables of dates or time. The lubridate package! The documentation for the package is here: https://lubridate.tidyverse.org/

nest <- nest %>%
  mutate(DateTime = force_tz(Date, "America/Chicago")) %>% #force the time zone to be Central Time
  mutate(DateOnly = date(DateTime)) #pull out the date only
  1. Use the functions hour() and minute() to create new variables for the hour in a day and minutes within the hour. Then create a quantitative variable in nest, called Time, that combines the hour and minute such that 9:30am is 9.5.

ANSWER:

#mutate adds variables to the data set
nest <- nest %>%
  mutate() %>%  #start with Hour and Minute
  mutate() #then do Time
  1. Use the functions day() and month() to create new variables for day within the month and month within the year. Then create a quantitative variable in nest, called MonthDay, that combines the month and day such that April 15 is 4.5. Hint: days_in_month() will be useful to do this.

ANSWER:

  1. Remove the first day of measurements, Jan 18th, that has only the evening hours with filter().

ANSWER:

  1. Create a categorical variable of weekdays labels with wday(), called DayofWeek.

ANSWER:

  1. Create a line plot (geom_line) with DateTime and DiningTemp. Comment on what patterns you notice, what you learn and questions you have.

ANSWER:

  1. Create a line plot (geom_line) with Time and DiningTemp, grouped by MonthDay, colored by the factor(DayofWeek), and faceted by Month. Comment on what patterns you notice, what you learn and questions you have.

ANSWER:

  1. Create a scatterplot (geom_jitter) of DiningTemp with the DiningPrevTemp (temperature 5 or 10 minutes previously). Do the same for the observation 10-20 minutes previously. Comment on what patterns you notice, what you learn and questions you have.

ANSWER:

nest <- nest %>%
  mutate(DiningPrevTemp = lag(DiningTemp), DiningPrevTemp2 = lag(DiningTemp, 2), DiningPrevTemp3 = lag(DiningTemp, 3)) 
  1. Create a tile plot (geom_tile) of Time against MonthDay filled by factor(DiningMode), but only for the observations before May. Comment on what you patterns notice, what you learn and questions you have.

ANSWER:

  1. Explore the relationship between outside temperature and the heating and cooling systems. Create plots with short summaries of the insights you gain about the energy efficient of our home, the schedule of our thermostat, etc. Be creative. Think outside the box. Come up with what you want to plot and ask for help with coding questions.

Some ideas to get you started: summarize each day by the proportion of time the heater was on (or longest/shortest stretch of time with no heat or the max/min difference between Target Temp and Actual Temp) as well as summaries of temperature (min, mean, median, max).

ANSWER: