Linear Regression

Lecture 14

Published

June 5, 2025

While you wait…

Go to your ae project in RStudio.
Make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.
Click Pull to get today’s application exercise file: ae-12-modeling-penguins.qmd.
Wait till the you’re prompted to work on the application exercise during class before editing the file.

Correlation vs. causation

Spurious correlations

Spurious correlations

Source: tylervigen.com/spurious-correlations

Linear regression with a single predictor

Data overview

movie_scores |>
  select(critics, audience)

# A tibble: 146 × 2
   critics audience
     <int>    <int>
 1      74       86
 2      85       80
 3      80       90
 4      18       84
 5      14       28
 6      63       62
 7      42       53
 8      86       64
 9      99       82
10      89       87
# ℹ 136 more rows

Data visualization

Data visualization: linear model

Data visualization: linear model

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   32.3      2.34        13.8 4.03e-28
2 critics        0.519    0.0345      15.0 2.70e-31

Data visualization: linear model

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   32.3      2.34        13.8 4.03e-28
2 critics        0.519    0.0345      15.0 2.70e-31

# A tibble: 1 × 1
      r
  <dbl>
1 0.781

Prediction: linear model

# A tibble: 146 × 4
   critics audience .pred .resid
     <int>    <int> <dbl>  <dbl>
 1      74       86  70.7  15.3 
 2      85       80  76.4   3.60
 3      80       90  73.8  16.2 
 4      18       84  41.7  42.3 
 5      14       28  39.6 -11.6 
 6      63       62  65.0  -2.99
 7      42       53  54.1  -1.10
 8      86       64  76.9 -12.9 
 9      99       82  83.7  -1.66
10      89       87  78.5   8.52
# ℹ 136 more rows

Linear Regression: How R Did It

Regression model

\[ Y \;=\; \mathbf{Model} \;+\; \text{Error} \]

\[ =\; \mathbf{f(X)} \;+\; \epsilon \]

\[ =\; \mu_{Y\mid X} \;+\; \epsilon \]

Regression model

\[ \begin{aligned} Y &= \color{#325b74}{\textbf{Model}} + \text{Error} \\[8pt] &= \color{#325b74}{\mathbf{f(X)}} + \epsilon \\[8pt] &= \color{#325b74}{\boldsymbol{\mu_{Y|X}}} + \epsilon \end{aligned} \]

Simple linear regression

Use simple linear regression to model the relationship between a quantitative outcome (\(Y\)) and a single quantitative predictor (\(X\)).

. . .

\[\Large{Y = \beta_0 + \beta_1 X + \epsilon}\]

\(Y\) : True \(Y\) values
\(\beta_1\): True slope of the relationship between \(X\) and \(Y\)
\(\beta_0\): True intercept of the relationship between \(X\) and \(Y\)
\(\epsilon\): Error (random noise)

Simple linear regression

Use simple linear regression to model the relationship between a quantitative outcome (\(Y\)) and a single quantitative predictor (\(X\)).

. . .

\[\Large{\hat{Y} = b_0 + b_1 X}\]

\(\hat{Y}\) : Fitted \(Y\) values
\(b_1\): Estimated slope of the relationship between \(X\) and \(Y\)
\(b_0\): Estimated intercept of the relationship between \(X\) and \(Y\)
No error term!

Choosing values for \(b_1\) and \(b_0\)

Residuals

\[\small{\text{residual} = \text{observed} - \text{predicted} = y - \hat{y}}\]

Notation

We have \(n\) observations (generally, the number of rows in a df)
\(i^{th}\) observation (\(i\) from \(1\) to \(N\)):
- \(y_i\) : \(i^{th}\) outcome
- \(x_i\) : \(i^{th}\) explanatory variable
- \(\hat{y}\) : \(i^{th}\) predicted outcome
- \(e\) : \(i^{th}\) residual

Notation: Example

Back to the movies: audience scores predicted by critic scores.

# A tibble: 146 × 4
   critics audience .pred .resid
     <int>    <int> <dbl>  <dbl>
 1      74       86  70.7  15.3 
 2      85       80  76.4   3.60
 3      80       90  73.8  16.2 
 4      18       84  41.7  42.3 
 5      14       28  39.6 -11.6 
 6      63       62  65.0  -2.99
 7      42       53  54.1  -1.10
 8      86       64  76.9 -12.9 
 9      99       82  83.7  -1.66
10      89       87  78.5   8.52
# ℹ 136 more rows

\(x_1 = 74\)
\(y_1 = 86\)
\(\hat{{y}}_1 \approx 71\)
\(e_1 \approx 15\)

Least squares line

The residual for the \(i^{th}\) observation is

\[e_i = \text{observed} - \text{predicted} = y_i - \hat{y}_i\]

. . .

The sum of squared residuals is

\[e^2_1 + e^2_2 + \dots + e^2_n\]

. . .

The least squares line is the one that minimizes the sum of squared residuals

Least squares line

Least squares line

movies_fit <- linear_reg() |>
  fit(audience ~ critics, data = movie_scores)

tidy(movies_fit)

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   32.3      2.34        13.8 4.03e-28
2 critics        0.519    0.0345      15.0 2.70e-31

. . .

Let’s interpret this!

Slope and Intercept

Interpreting slope & intercept

\[\widehat{y} = b_0 + b_1 \times x\]

Slope: For every one increase in the value of \(x\), we expect \(y\) to be higher by \(b_1\), on average.
Intercept: If \(x = 0\), we expect \(y = b_0\)

Interpreting slope & intercept

\[\widehat{\text{audience}} = 32.3 + 0.519 \times \text{critics}\]

Slope: For every one point increase in the critics score, we expect the audience score to be higher by 0.519 points, on average.
Intercept: If the critics score is 0 points, we expect the audience score to be 32.3 points.

Is the intercept meaningful?

✅ The intercept is meaningful in context of the data if

the predictor can feasibly take values equal to or near zero or
the predictor has values near zero in the observed data

. . .

🛑 Otherwise, it might not be meaningful!

Properties of least squares regression

Slope (\(b_1\)) has the same sign as the correlation coefficient (\(r\)): \(b_1 = r \frac{s_Y}{s_X}\)
The regression line goes through the center of mass point (the coordinates corresponding to average \(X\) and average \(Y\))
Sum of the residuals is zero

Application exercise

ae-12-modeling-penguins

Go to your ae project in RStudio.
If you haven’t yet done so, make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.
If you haven’t yet done so, click Pull to get today’s application exercise file: ae-12-modeling-penguins.qmd.
Work through the application exercise in class, and render, commit, and push your edits.

Regression with Categorical Variables

What does \(b_0 +b_1*x\) even mean when \(x\) is categorical ???

. . .

Suppose variable island can take values A, B, or C. We want to model variable mass based on island.

. . .

We tell R to fit mass ~ island … now what?

. . .

We get dummy variables!!!

Dummy Variables

mass	island
10	A
30	C
20	B
15	A

Dummy Variables

mass	island
10	A
30	C
20	B
15	A

mass	island	A	B	C
10	A
30	C
20	B
15	A

. . .

Important

In a given row, only one of the dummy variables for a given categorical variable can equal 1.

What will a regression output look like?

\[ mass = b_0 + b_2 * B + b_3 *C \]

. . .

What is the estimated weight of a penguin from island A?

. . .

What is the estimated weight of a penguin from island B?

. . .

What is the estimated weight of a penguin from island C?

. . .

While you wait…

Correlation vs. causation

Spurious correlations

Spurious correlations

Linear regression with a single predictor

Data overview

Data visualization

Data visualization: linear model

Data visualization: linear model

Data visualization: linear model

Prediction: linear model

Linear Regression: How R Did It

Regression model

Regression model

Simple linear regression

Simple linear regression

Choosing values for \(b_1\) and \(b_0\)

Residuals

Notation

Notation: Example

Least squares line

Least squares line

Least squares line

Slope and Intercept

Interpreting slope & intercept

Interpreting slope & intercept

Is the intercept meaningful?

Properties of least squares regression

Application exercise

ae-12-modeling-penguins

Regression with Categorical Variables

Regression with Categorical Variables

Dummy Variables

Dummy Variables

What will a regression output look like?

Let’s take a look at the actual penguins data. Back to the AE!