---
title: "Simple linear regression: Transformations (Notes)"
subtitle: "STAT 155"
author: "Your Name"
format:
  html:
    toc: true
    toc-depth: 2
    embed-resources: true
---


```{r setup}
#| include: false
knitr::opts_chunk$set(
  collapse = TRUE, 
  warning = FALSE,
  message = FALSE,
  error = TRUE,
  fig.height = 2.75, 
  fig.width = 4.25,
  fig.env = 'figure',
  fig.pos = 'h',
  fig.align = 'center')
```


::: {.callout-note title = "Organize your files"}

This **qmd file** is where you'll type notes, code, etc.
Directions:

- Save this file in the `inclass_activities` sub-folder of the `stat155` folder you created before today's class. Use a file name related to the activity number and/or today's date (eg: "activity 6" or "6 transformations").
:::



## Warm-up {-}

### Example 1: Modeling intuition

Since 1986, *The Economist* has published "The Big Mac Index" as a metric for comparing purchasing power between cities, giving rise to the phrase **Burgernomics**.
It was developed (sort of jokingly) as a way to explain exchange rates in digestible terms.
OPTIONAL: You can [read more about the Big Mac index](https://www.economist.com/big-mac-index) in *The Economist*.

The `bigmac` data below, collected in 2006, includes various information on 70 cities:

```{r}
# Load necessary packages
library(tidyverse)

# Import data
bigmac <- read_csv("https://mac-stat.github.io/data/bigmac.csv") %>% 
  rename(income = gross_annual_teacher_income) %>% 
  select(city, bigmac_mins, income)

# Check it out
head(bigmac)
dim(bigmac)
```

Included are 3 variables:

- `city`
- `bigmac_mins`: average number of minutes of work it takes to afford 1 Big Mac
- `income`: the average gross *teacher* salary in 1 year (USD)


a. Our goal will be to explore the relationship of `bigmac_mins` with teacher `income`, i.e. the extent to which the work time needed to afford a Big Mac in a city might be explained by the average teacher income in that city. What do you *expect* this to look like? For example, will `bigmac_mins` increase or decrease as `income` increases? Will the relationship be linear? Will it be strong?


b. Check your intuition! Construct and discuss a visualization of `bigmac_mins` vs `income`, including a representation of a simple linear regression model of their relationship.

```{r}
___ %>% 
  ___(___(y = ___, x = ___)) + 
  geom___() + 
  geom___(method = ___, se = FALSE)
```



c. What might we do to fix things here?!?




::: {.callout-note title = "Discussing plots of relationships"}

When discussing a visualization of the relationship between 2+ variables, remember to comment on:

1. direction
2. strength
3. shape / form
4. any outliers

:::




### Example 2: Transformations

In the image below, each row contains an example of a transformation:

- far left plot = y vs x
- middle plot = y vs transformed x
- right plot = models of y vs x and y vs transformed x on same frame


![](https://bcheggeseth.github.io/155_spring_2026/images/transformation_examples.png)


For each row, indicate how the transformation impacted the point cloud & model.


Row 1 (location transformation): X to X - 32


Row 2 (scale transformation): X to 5/9 X


Row 3 (location & scale transformation): X to 5/9 (X - 32)


Row 4 (log transformation): X to log(X)



\
\
\
\


## Exercises {-}


**Goal**

- Use *visualizations* to explore the impact of transforming a predictor variable.
- Explore how transformations of a predictor variable may change our regression models and their interpretations.



### Exercise 1: mutate()

If we want to work with a *transformed* version of a variable in our dataset, we must *define and store* this variable using the `mutate()` function in the `tidyverse`.
You learned about `mutate()` in PS 1.

```{r}
# Define a variable called bigmac_hrs that records the BigMac info in hours, not minutes
bigmac %>% 
  mutate(bigmac_hrs = ___) %>% 
  head()
```

```{r}
# If we want to use it later, we should store the bigmac_hrs variable in the bigmac dataset
# DO NOT INCLUDE head() OR WE'LL JUST SAVE 6 ROWS!
bigmac <- bigmac %>% 
  mutate(bigmac_hrs = ___)
```

```{r}
# Check our work!
head(bigmac)
dim(bigmac)
```



\
\
\
\




### Exercise 2: Transformations can help make our model less wrong

Let's return to our analysis of `bigmac_min` vs teacher `income`.
We observed above that this relationship is non-linear, thus a linear regression model of `bigmac_min` by `income` would be *wrong*.
Since changes in `income` are often thought of in terms of *percentages* than raw *units* (dollars), we might be able to fix this using a *log* transform of `income`:

```{r}
# Create the log_income and log_bigmac variables
bigmac <- bigmac %>%
  mutate(log_income = log(income), log_bigmac = log(bigmac_mins))

head(bigmac)
```

Check out a series of plots:

```{r}
# bigmac_mins vs income
bigmac %>% 
  ggplot(aes(y = bigmac_mins, x = income)) + 
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE) +
  geom_smooth(color = 'red', se = FALSE) 
```

```{r}
# bigmac_mins vs log_income
bigmac %>% 
  ggplot(aes(y = bigmac_mins, x = log_income)) + 
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE) +
  geom_smooth(color = 'red', se = FALSE) 
```

```{r}
# log_bigmac vs log_income
bigmac %>% 
  ggplot(aes(y = log_bigmac, x = log_income)) + 
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE) +
  geom_smooth(color = 'red', se = FALSE) 
```

```{r}
# log_bigmac vs log_income
# BUT with axis labels on the scale of the original units (not logged units)
# Note that we use the original variables, then log them in the last 2 lines!
bigmac %>% 
  ggplot(aes(y = bigmac_mins, x = income)) + 
  geom_point() +
  scale_x_continuous(trans = "log") +
  scale_y_continuous(trans = "log") +
  geom_smooth(method = 'lm', se = FALSE) +
  geom_smooth(color = 'red', se = FALSE) 
```


**FOLLOW-UPS**

a. Which of the following models of Big Mac time by income would be the least "wrong":

    - E[Big Mac time | income] = $\beta_0$ + $\beta_1$ income
    - E[Big Mac time | log(income)] = $\beta_0$ + $\beta_1$ log(income)
    - E[log(Big Mac time) | log(income)] = $\beta_0$ + $\beta_1$ log(income)

b. Which of the models above would be the toughest to interpret? (What are the trade-offs between interpretability and "correctness"?)





\
\
\
\


### Exercise 3: Transformations impact the meaning of our coefficients

In the previous exercise, a log transformation helped make our model less wrong, but we have to be careful when interpreting the new model!
Let's dig into the relationship of `bigmac_mins` vs `log_income`.
This relationship is *linear*, though it still has *unequal variance*:

```{r}
bigmac %>% 
  ggplot(aes(y = bigmac_mins, x = log_income)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)
```

Let's build and interpret this model:

```{r}
log_model <- lm(bigmac_mins ~ log_income, data = bigmac)
summary(log_model)
```

The estimated model formula is:

E[`bigmac_mins` | `log_income`] = 210.875 - 18.142 `log_income`

Interpreting these on the *logged* average income scale isn't very helpful:

- $210.875$: In cities with a *logged* average income of 0 dollars, the average time it takes to afford 1 Big Mac is 210.875 minutes.

- $-18.142$: A 1 dollar increase in *logged* income is associated with an 18.142 minute decrease in the average time it takes to afford 1 Big Mac.


CHALLENGE: Utilizing the summary box below, translate these interpretation sentences to the *income* not *logged* income scale.


- $210.875$:


- $log(1.1)*-18.142$:


```{r}
log(1.1)*-18.142
```


::: {.callout-note title = "Interpreting coefficients for Y vs log(X)"}

$E[ Y | X ] =  \beta_0 + \beta_1 log(X)$

- Intercept
    - $\beta_0$ = expected value of Y when log(X) = 0
    - $\beta_0$ = expected value of Y when X = 1 (i.e. when log(X) = 0)

- Slope
    - $\beta_1$ = change in the expected value of Y associated with a 1 (logged) unit increase in log(X)
    - $log(1.1) * \beta_1$ = change in the expected value of Y if we increase X by 10%.

:::



\
\
\
\




### Exercise 4 (OPTIONAL): logging Y

NOTE: This is technically optional for Stat 155 (it will not be on a quiz), but if you plan to continue taking courses in Statistics, Data Science, or Economics, take the time to go through this after class!!

\
\
\

Above we learned about the impact of logging X.
Consider what happens if we log Y (but not X):

$E[ log(Y) | X ] =  \beta_0 + \beta_1 X$

- Intercept
    - $\beta_0$ = average log(Y) outcome when X = 0
    - $e^{\beta_0}$ = geometric average of Y when X = 0

- Slope
    - $\beta_1$ = change in the average log(Y) outcome associated with a 1-unit increase in X
    - $e^{\beta_1}$ = multiplicative change in the geometric average of Y associated with a 1-unit increase in X

\
\

NOTE: 
The *arithmetic* average of log(Y) (left) is equivalent to the log of the *geometric* average of Y (right): 

$$
\frac{1}{n}(log(Y_1) + log(Y_2) + \cdots + log(Y_n)) = log\left[\left(Y_1*Y_2* \cdots *Y_n \right)^{1/n} \right]
$$

Or in shorthand notation:

$$
\frac{1}{n}\sum_{i=1}^n log(Y_i) = log\left[\left( \prod_{i=1}^n Y_i\right)^{1/n} \right]
$$



Your turn:

- Build and discuss a plot of `log_bigmac` vs `income` (don't use the logged income).
- Fit a linear regression model of `log_bigmac` by `income` and interpret the coefficient estimates.



\
\
\
\


### Exercise 5 (OPTIONAL): Proving the impacts of logs

NOTE: This is technically optional for Stat 155 (it will not be on a quiz), but if you plan to continue taking courses in Statistics, Data Science, or Economics, take the time to go through this after class!!


Above, you practiced interpreting coefficients on the logged and unlogged scales when our model includes a log transformation (either for Y or X):

$E[Y | log(X)] = \beta_0 + \beta_1 log(X)$

$E[log(Y) | X] = \beta_0 + \beta_1 X$

You did so using provided definitions.

If you're curious to *prove* these definitions, and to explore what happens if we log *both* Y and X, check out this [free resource on the topic](https://stats.libretexts.org/Bookshelves/Advanced_Statistics/Intermediate_Statistics_with_R_(Greenwood)/07%3A_Simple_linear_regression_inference/7.06%3A_Transformations_part_II_-_Impacts_on_SLR_interpretations_-_log(y)_log(x)_and_both_log(y)_and_log(x)).


\
\
\
\



### Exercise 6: Transformations can help make our model easier to interpret

The log exercises above illustrate how transformations can help make our model less wrong.
They can also make our model easier to interpret!!
Let's revisit the High Peaks hiking data with a goal of exploring the relationship of average hiking `time` in hours with distance (length):

```{r}
# Import data & check it out
peaks <- read_csv("https://mac-stat.github.io/data/high_peaks.csv")
head(peaks)
```


#### Part a

Currently, hike `length` is measured in *miles*.
But suppose we're more comfortable with, hence prefer to work with, hike length in *kilometers*, not *miles*.
Since 1 mile is roughly 1.60934 km, this would be a *scale transformation*.
Fill in the code below to define & store `length_km` in the `peaks` dataset:

```{r}
# Define length_km
peaks <- ___ %>%
  ___(length_km = 1.60934 * length)

# Check it out
peaks %>% 
  select(time, length, length_km) %>% 
  head()
```

#### Part b

Check out some plots of hiking time by distance:

```{r}
# hiking time vs length in miles
peaks %>% 
  ggplot(aes(y = time, x = length)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
```

```{r}
# hiking time vs length in km
peaks %>% 
  ggplot(aes(y = time, x = length_km)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
```

```{r}
# both plots on the same axes
# Don't worry about the code!!
peaks %>% 
  select(time, length, length_km) %>% 
  pivot_longer(cols = -time, names_to = "Predictor", values_to = "length") %>% 
  ggplot(aes(y = time, x = length)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) + 
  facet_wrap(~ Predictor) + 
  lims(y = c(0, 20), x = c(0, 30))
```


**FOLLOW-UP:**

- What impact did the scale transformation have? Specifically, do these 2 models have the same intercepts? The same slopes?

- From an interpretation perspective, you might prefer one model over the other depending on whether you're more comfortable with miles or kilometers. Mathematically, is one model better than the other? For example, is one less wrong? Is one stronger / have a higher R-squared?



\
\
\
\
\



### Exercise 7: Digging into scale transformations

Let's explore how a predictor *scale* transformation, like changing length from miles to kilometers, impacts our coefficients (hence their interpretations).
First, let's model hiking `time` by `length` in *MILES*:

```{r}
peaks_model_1 <- lm(time ~ length, data = peaks)
summary(peaks_model_1)
```

The resulting estimated model formula is below:

E[time | length] = 2.04817 + 0.68427 length

where the `length` coefficient indicates that a *1 mile* increase in hike length is associated with a 0.68427 *hour* increase in the expected hiking time.



#### Part a

In the previous exercise we performed a *scale transformation* to define length in *kilometers*, not *miles*:

`length_km = 1.60934 * length`

Suppose then that we modeled `time` by `length_km` instead of `length` (in miles):

E[time | length_km] = $\beta_0$ + $\beta_1$ length_km

We can interpret the model coefficients as follows:

- $\beta_0$ = the expected hiking time when `length_km` is 0, hence when `length` is 0 (since `length_km = 1.60934 * length`)
- $\beta_1$ = the change in the expected hiking time associated with a 1 *km* increase in length, i.e. a `1/1.60934` *mile* increase in length since `length_km = 1.60934 * length`

Use these interpretations and the original `peaks_model_1` (summarized below) to determine what the new coefficients should be:

- E[time | length] = 2.04817 + 0.68427 length
- E[time | length_km] = ??? + ??? length_km



#### Part b

Check your intuition!

```{r}
# Fit a model of time vs length_km
peaks_model_2 <- lm(time ~ length_km, data = peaks)

# Display the model summary
summary(peaks_model_2)
```


#### Part c

So we now have 2 models of average hiking `time` by hike length, as measured by `length` and `length_km`:

- E[time | length] = 2.04817 + 0.68427 length
- E[time | length_km] = 2.04817 + 0.42519 length_km

As indicated by the equal intercepts but differing slopes, these models have the same "location" but differing "scales" hence rates of change:

```{r}
# Don't worry about the code right now!!
peaks %>% 
  select(time, length, length_km) %>% 
  pivot_longer(cols = -time, names_to = "Predictor", values_to = "length") %>% 
  ggplot(aes(y = time, x = length)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) + 
  facet_wrap(~ Predictor) + 
  lims(y = c(0, 20), x = c(0, 30))
```

Importantly, they produce the same predictions!
For example, use both models to predict the average hiking time of a 10 mile hike.
These two predictions should be the same, within rounding.

```{r}
# Predicting price with the peaks_model_1
2.04817 + 0.68427*___

# Predicting price with the peaks_model_2
2.04817 + 0.42519*___
```

#### Part d (CHALLENGE)

Suppose we start with the model:

$E[Y | X] = \beta_0 + \beta_1 X$

Reflecting on your work above, summarize "rules" for how the intercept and slope are impacted if we perform a scale transformation, bX.
(Do these change? If so, *how* do they change? How does this change depend upon "b"?)
Either answer this in words, or by filling in the formula below:

$E[Y | bX] = ___ + ___ (bX)$


\
\
\
\






### Exercise 8: Location transformations (Part 1)

Another type of transformation that can improve the interpretability of our model is a *location transformation*.
The `homes` data includes 2006 data on homes in Saratoga County, New York:

```{r}
# Import data
homes <- read_csv("https://mac-stat.github.io/data/homes.csv")
head(homes)
```

Our goal is to understand the relationship of home `Price` ($) with `Living.Area`, the size of a home in square feet:

```{r}
homes %>% 
  ggplot(aes(y = Price, x = Living.Area)) + 
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
```


#### Part a

Fit a linear regression model of `Price` by `Living.Area`:

```{r}
# Fit the model
home_mod <- ___(Price ~ Living.Area, ___)

# Display model summary output
___(home_mod)
```


#### Part b

In context, the *intercept* indicates that the average price of a 0 square foot home is $13,439.394.
Confirm that you agree with this statement *and* explain why it isn't meaningful / sensible.


#### Part c

The issue here is that the "baseline" of `home_mod` is 0 square foot homes, but the *smallest* house is 616 square feet (far from 0):

```{r}
homes %>% 
  summarize(min(Living.Area))
```

If we want a more meaningful baseline, we can use a *location transformation* to "start" the `Living.Area` predictor at a more reasonable value (not 0).
Specifically, we can *center* this predictor at *600* square feet (a more meaningful number than 612 in this context) by defining a new predictor:

`Living.Area.Shifted = Living.Area - 600`

Fill in the code below define this predictor:

```{r}
# Define Living.Area.Shifted
homes <- homes %>%
  ___(Living.Area.Shifted = Living.Area - 600)

# Check it out
homes %>% 
  select(Price, Living.Area, Living.Area.Shifted) %>% 
  head()
```



\
\
\
\



### Exercise 9: Location transformations (Part 2)

Consider a new model that uses the new `Living.Area.Shifted` predictor:

E[Price | Living.Area.Shifted] = $\beta_0$ + $\beta_1$ Living.Area.Shifted

We can interpret the model coefficients as follows:

- $\beta_0$ = the expected home price when (living area - 600) is 0, i.e. when living area is 600
- $\beta_1$ = the change in the expected home price associated with a 1 square foot increase in (living area - 600), hence a 1 square foot increase in living area


#### Part a

Use the above coefficient interpretations and the original `home_mod` (summarized below) to determine what the new coefficients should be:

- E[Price | Living.Area] = 13439.394 + 113.123 Living.Area
- E[Price | Living.Area.Shifted] = ??? + ??? Living.Area.Shifted



#### Part b

Check your intuition!

```{r}
# Fit a model of Price vs. Living.Area.Shifted
# Save this as home_mod_2
home_mod_2 <- lm(Price ~ Living.Area.Shifted, data = homes)

# Display the model summary
summary(home_mod_2)
```


#### Part c

So we now have 2 models of `Price` by the size of a home, as measured by `Living.Area` and `Living.Area.Shifted`:

- E[Price | Living.Area] = 13439.394 + 113.123 Living.Area
- E[Price | Living.Area.Shifted] = 81312.919 + 113.123 Living.Area.Shifted

As indicated by the equal slopes but differing intercepts, these models simply have different locations:

```{r}
# Don't worry about the code!!
homes %>% 
  select(Price, Living.Area, Living.Area.Shifted) %>% 
  pivot_longer(cols = -Price, names_to = "Predictor", values_to = "Size") %>% 
  ggplot(aes(y = Price, x = Size)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) + 
  facet_wrap(~ Predictor)
```

Importantly, they produce the same predictions!
For example, use both models to predict the price of a 1000 square foot home (without using the `predict()` function).
These two predictions should be the same, within rounding.

```{r}
# Predicting price with the home_mod
13439.394 + 113.123*___

# Predicting price with the home_mod_2
81312.919 + 113.123*___
```


#### Part d: CHALLENGE

Suppose we start with the model:

$E[Y | X] = \beta_0 + \beta_1 X$

Reflecting on your work above, summarize "rules" for how the intercept and slope are impacted if we perform a location transformation, X - a.
(Do these change? If so, *how* do they change? How does this change depend upon "a"?)
Either answer this in words, or by filling in the formula below:

$E[Y | X - a] = ___ + ___ (X - a)$



\
\
\
\



### Reflection

Two of the main motivations for transforming variables in our regression models is to (1) intentionally change the interpretation of regression coefficients, and (2) to better satisfy linear regression assumptions (e.g. remove "patterns" from our residual plots). The first is nearly always justified by the scientific context of the research questions you are trying to answer, while the second is a bit more muddy.

Think about the pros and cons of transforming your variables to satisfy linear regression assumptions. Is there a limit to how much you would be willing to transform your variables? Would transforming **too** much leave you with un-interpretable regression coefficients?

> **Response:** Put your response here.


\
\
\
\




## Extra exercises {-}


### Exercise 10: Univariate review

Recall our `bigmac` data:

```{r}
head(bigmac)
```

Included are 3 variables:

- `city`
- `bigmac_mins`: average number of minutes of work it takes to afford 1 Big Mac
- `income`: the average gross *teacher* salary in 1 year (USD)

Let's explore.

```{r}
# What were the bigmac_mins in New York in 2006?
# Show just these 2 variables.


# Show data on the city that had the smallest bigmac_mins in 2006

```

```{r}
# Construct and discuss a visualization of bigmac_mins
# Choose a visualization that helps you answer the below question:
# In roughly how many cities were did it take between 50 and 75 minutes to afford a Big Mac?

```

```{r}
# Calculate the typical bigmac_mins across all cities in 2006
# (What's an appropriate measure here?)

```



\
\
\
\



::: {.callout-note title = "Discussing univariate visualizations"}

When discussing a visualization for a single variable, remember to comment on:

1. central tendency (what's typical?)
2. spread (how much variability is there?)
3. shape of the distribution (normal? right-skewed? left-skewed? something else?)
4. any outliers

:::






\
\
\
\




### Exercise 11: Modeling review

Fit the linear regression model of `bigmac_mins` vs `income`:

```{r}
# Fit the model
bigmac_model_1 <- ___(___, ___)
```

```{r}
# Get a model summary table

```

We already know this is a bad model!
Let's just use it to practice some concepts...

a. Write out the estimated model formula.


b. Interpret the `income` coefficient. Remember: context, units, averages (not individuals), and association (not correlation).


c. Predict the number of minutes it takes to afford a Big Mac in a city with an average teacher income of \$4800.


d. Riga had an average teacher income of \$4800 and a "Big Mac time" of 28 minutes. Calculate its residual, i.e. prediction error.

```{r}

```


\
\
\
\


### Exercise 12: Model evaluation review

*1. Is it wrong?*

Construct an appropriate evaluative plot of `bigmac_model_1`.
Use it to discuss which of our LINE assumptions (linearity, independence, normality, equal variance) this model appears to violate.

```{r}
# Residual plot
___ %>% 
  ggplot(aes(y = ___, x = ___)) + 
  geom___() + 
  geom_hline(yintercept = ___)
```

\

*2. Is it strong?*

Answer this question using an appropriate metric, and interpret that metric.


\

*3. Is it fair?*

For which cities does this model give poor predictions, hence a misleading conclusion about Big Mac affordability?











## Done!

- Finalize your notes: (1) Render your notes to an HTML file; (2) Inspect this HTML in your Viewer -- check that your work translated correctly; and (3) Outside RStudio, navigate to your `inclass_activities` subfolder within your `stat155` folder and locate the HTML file -- you can open it again in your browser.
- Clean up your RStudio session: End the rendering process by clicking the 'Stop' button in the 'Background Jobs' pane.
- Check the solutions in the course website, at the bottom of the corresponding chapter.
- Work on homework and/or any extra practice exercises!


