1 Introduction

Learning Goals

  • Explain the similarities and differences between time series, longitudinal, and spatial data.
  • Explain why and how standard methods such as linear regression (estimated by ordinary least squares abbreviated as OLS) fail on correlated data.
  • Be comfortable with working with time/date data using the lubridate R package.
  • Further develop your comfort playing with and manipulating data in R.




Introductions

Directions:

You’ll spend about 10 minutes in small groups.

When we come back together, I’m going to ask everyone to take a turn and do the following:

  • Introduce one of your small group mates.
  • Share one insight/idea from your discussion about learning experiences / grades.

So take a moment to:

  • Introduce yourselves in whatever way you feel appropriate (e.g. name, pronouns, majors/minors, how you’re feeling at the moment, things you’re looking forward to, why you are motivated to take this class). You can generate a random “ice breaker” at https://mackenziekbrooks.github.io/icebreaker-generator/.

  • Discuss the following:

    • What are some things that have contributed to positive learning experiences in your courses that you would like to have in place for this course? What has contributed to negative experiences that you would like to prevent?
    • What is the purpose of grades? Have your grades captured your learning in the past? If not, what do they reflect?

Group Activity: Playing with Data

In this course, we will spend time with real data as well as the theory of the statistical models and methods that we use for correlated data.

The first half of the semester will be theory-heavy so that we can establish some important theoretical threads that we will weave through the different types of correlated data.

So today, we work with real, correlated data and start thinking about how we may plot temporal data (data collected over time) using standard methods and trying to come up with creative ways to explore this data.

Directions

  1. Download a template RMarkdown file to start from here.
  2. Each of you should work on your own RMarkdown file, but you need to check in with each other on each question. Help each other. Share ideas and insight.
  3. Open RStudio and the template file. Make sure you have installed the R packages you need for today.

We will be using the tidyverse and lubridate package today. If you have not seen these packages yet, I recommend checking out the data transformation and visualization cheatsheets (https://www.rstudio.com/resources/cheatsheets/). The symbol %>% is a pipe and it passes whatever is to the left as the first argument to the function on the right.

#Run this R chunk first!
library(tidyverse)
library(lubridate) #install.packages('lubridate') if needed

#options(gargle_oob_default = TRUE) #run if you are working on Mac RStudio Server

Data Context: Temperatures & Energy Use

Minnesota winters are cold. In St. Paul, homes and apartments are typically 50 to 100 years old or older. This means they may not be energy efficient in that they may be drafty and without modern insulation.

To learn about energy use in a St. Paul home, Prof. Heggeseth’s family installed a Nest thermostat in the Dining Room in Jan 2019 to control the heat and record the temperature and energy use in their home (built in 1914). They installed another Nest thermostat Upstairs in July 2019 to control the air conditioning (cooling). Then in Fall 2019, they redid their heating system to have three zones and added a Basement thermostat. You have access to almost a full year of data in 2019.

Exercises

Today, we are going to explore real data that are continuously being collected, every five to ten minutes, and stored in a Google spreadsheet. With the code below, we read in the data stored in Google Sheets.

nest  <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQnSHPcthKoRmgCOgoW-Sl_vRRm2mUTKhKIYW04OQqOAste8lGLI71fq3WQIvtC4C8PBdSNfnw21PB1/pub?output=csv")

Check out the variables in the data set. Note that Location1, Location2, Location3 are not going to be useful variables for us; they are constant, so I remove them with the code below.

nest <- nest %>%
  select(-Location1,-Location2,-Location3)

head(nest) #prints first 6 rows of data frame
  1. List the variables and their type (categorical, continuous) and guess as to what they measure.

ANSWER:

Before we go any further, I want to introduce you to a package that will be very useful for you especially if we have variables of dates or time. The lubridate package! The documentation for the package is here: https://lubridate.tidyverse.org/

nest <- nest %>%
  mutate(Date = mdy_hms(Date)) %>% # convert character string to date/time object
  mutate(DateTime = force_tz(Date, "America/Chicago")) %>% #force the time zone to be Central Time
  mutate(DateOnly = date(DateTime)) #pull out the date only
  1. Use the lubridate functions hour() and minute() within mutate() to create new variables for the hour in a day and minutes within the hour. Create a quantitative, continuous variable in the nest dataset called Time, that combines the hour and minute with decimal values of 9.5 to represent 9:30am and 21.5 to represent 9:30pm, for example.

ANSWER:

#mutate adds variables to the data set
nest <- nest %>%
  mutate() %>%  #start with Hour and Minute
  mutate() #then do Time
  1. Use the lubridate functions day() and month() to create new variables for day within the month and the month within the year. Create a quantitative variable in the nest dataset, called MonthDay, that combines the month and day such that April 15 is 4.5, for example. Hint: days_in_month() will be useful to do this.

ANSWER:

  1. Create a categorical variable called DayofWeeek with weekdays labels using wday().

ANSWER:

  1. Create a line plot (geom_line) with DateTime and DiningTemp. Comment on what patterns you notice, what you learn and questions you have.

ANSWER:

  1. Create a line plot (geom_line) with Time and DiningTemp, grouped by MonthDay, colored by the factor(DayofWeek), and faceted by Month. Comment on what patterns you notice, what you learn and questions you have.

ANSWER:

  1. Create a scatterplot (geom_jitter) of DiningTemp with the DiningPrevTemp (temperature 5 or 10 minutes previously). Do the same for the observation 10-20 minutes previously. Comment on what patterns you notice, what you learn and questions you have.

ANSWER:

nest <- nest %>%
  mutate(DiningPrevTemp = lag(DiningTemp), DiningPrevTemp2 = lag(DiningTemp, 2), DiningPrevTemp3 = lag(DiningTemp, 3)) 
  1. Create a tile plot (geom_tile) of Time against MonthDay filled by factor(DiningMode), but only for the observations before May. Comment on what you patterns notice, what you learn and questions you have.

ANSWER:

  1. Explore the relationship between outside temperature and the heating and cooling systems. Create plots with short summaries of the insights you gain about the energy efficient of our home, the schedule of our thermostat, etc. Be creative. Think outside the box. Come up with an idea of what you want to plot and ask for help with coding questions.

Some ideas to get you started: summarize each day by the proportion of time the heater was on (or longest/shortest stretch of time with no heat or the max/min difference between Target Temp and Actual Temp) as well as summaries of temperature (min, mean, median, max).

ANSWER: