Logistic regression

Lecture 18

Published

June 11, 2025

While you wait…

  • Go to your ae project in RStudio.

  • Make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.

  • Click Pull to get today’s application exercise file: ae-14-spam-filter.qmd.

  • Wait till the you’re prompted to work on the application exercise during class before editing the file.

Thus far…

We have been studying regression:

  • What combinations of data types have we seen?

  • What did the picture look like?

Linear Models: Just because you can…

… doesn’t mean you should!!

. . .

Linear Models: Just because you can…

… doesn’t mean you should!!

Linear models have infinite range

Today: a binary response

Categorical with two levels (0 or 1).

. . .

  • Yes (1) vs. No (0)
  • Win (1) vs. Lose (0)
  • True (1) vs. False (0)
  • Heads (1) vs. Tails (0)
  • And so much more!

. . .

\[ y = \begin{cases} 1 & &&\text{eg. Yes, Win, True, Heads, ...}\\ 0 & &&\text{eg. No, Lose, False, Tails, ...} \end{cases} \]

Example Plot

df
       x y
1  -1.00 0
2   0.72 1
3  -0.62 0
4   2.03 1
5   1.07 1
6   0.99 1
7   0.03 1
8   0.67 1
9   0.57 1
10  0.90 1

Who cares?

If we can model the relationship between predictors (\(x\)) and a binary response (\(y\)), we can use the model to do a special kind of prediction called classification.

Example: is the e-mail spam or not?

\[ \mathbf{x}: \text{word and character counts in an e-mail.} \]

\[ y = \begin{cases} 1 & \text{it's spam}\\ 0 & \text{it's legit} \end{cases} \]

Example: is it cancer or not?

\[ \mathbf{x}: \text{features in a medical image.} \]

\[ y = \begin{cases} 1 & \text{it's cancer}\\ 0 & \text{it's healthy} \end{cases} \]

Example: will they default?

\[ \mathbf{x}: \text{financial and demographic info about a loan applicant.} \]

\[ y = \begin{cases} 1 & \text{applicant is at risk of defaulting on loan}\\ 0 & \text{applicant is safe} \end{cases} \]

How do we model this type of data?

Straight line of best fit is a little silly

Modelling probabilities

Instead of modeling \(y\) directly, let’s model the probability that \(y=1\):

  • “Given new email, what’s the probability that it’s spam?’’
  • “Given new image, what’s the probability that it’s cancer?’’
  • “Given new loan application, what’s the probability that they default?’’

Modelling probabilities: lines are still silly

Instead: S-curve of best fit

Why don’t we model y directly?

  • Recall regression with a numerical response:

    Our models do not output guarantees for \(y\), they output predictions that describe behavior on average;

. . .


  • Similar when modeling a binary response:

    Our models cannot directly guarantee that \(y\) will be zero or one. The correct analog to “on average” for a 0/1 response is “what’s the probability?”

On average vs. What’s the probability?

Let’s suppose I’m classifying emails as spam ( \(y = 1\) ) vs. legit ( \(y\) = 0 ). At some given length (suppose \(x = 500\) words), I see that:

  • 8 emails were spam

  • 2 emails were legit

. . .

What does it mean to average these together?

. . .

\[ \frac{1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 0 + 0}{10} = \frac{8}{10} = 0.8 \]

Again: S-curve of best fit

So, what is this S-curve, anyways?

It’s the logistic function:

\[ \text{Prob}(y = 1) = \frac{e^{\beta_0+\beta_1x}}{1+e^{\beta_0+\beta_1x}}. \]

. . .

If you set \(p = \text{Prob}(y = 1)\) and do some algebra, you get the simple linear model for the log-odds:

. . .

\[ \log\left(\frac{p}{1-p}\right) = \beta_0+\beta_1x. \]

. . .

This is called the logistic regression model.

Log-odds?

  • \(p = \text{Prob}(y = 1)\) is a probability. A number between 0 and 1;

  • \(\frac{p}{(1 - p)}\) is the odds. A number between 0 and \(\infty\);

    • 80% probability an email is spam; 20% an email is legit

    • The odds an email is spam are 4 to 1

  • The log odds \(\log(\frac{p}{1 - p})\) is a number between \(-\infty\) and \(\infty\), which is suitable for the linear model.

. . .

\[ \log\left(\frac{p}{1-p}\right) = \beta_0+\beta_1x. \]

Probability to odds

Odds to log odds

Logistic regression

\[ \log\left(\frac{p}{1-p}\right) = \beta_0+\beta_1x. \]

  • The logit function \(\log(\frac{p}{1-p})\) is an example of a link function that transforms the linear model to have an appropriate range;

  • This is an example of a generalized linear model

Estimation

  • We estimate the parameters \(\beta_0,\,\beta_1\) using maximum likelihood (don’t worry about it) to get the “best fitting” S-curve;

  • The fitted model is

. . .

\[ \log\left(\frac{\widehat{p}}{1-\widehat{p}}\right) = b_0+b_1x. \]

Today’s data

# A tibble: 3,921 × 6
   spam  dollar viagra winner password exclaim_mess
   <fct>  <dbl>  <dbl> <fct>     <dbl>        <dbl>
 1 0          0      0 no            0            0
 2 0          0      0 no            0            1
 3 0          4      0 no            0            6
 4 0          0      0 no            0           48
 5 0          0      0 no            2            1
 6 0          0      0 no            2            1
 7 0          0      0 no            0            1
 8 0          0      0 no            0           18
 9 0          0      0 no            0            1
10 0          0      0 no            0            0
# ℹ 3,911 more rows

Fitting a logistic model

logistic_fit <- logistic_reg() |>
  fit(spam ~ exclaim_mess, data = email)

tidy(logistic_fit)
# A tibble: 2 × 5
  term          estimate std.error statistic p.value
  <chr>            <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)  -2.27      0.0553     -41.1     0    
2 exclaim_mess  0.000272  0.000949     0.287   0.774

Fitting a logistic model

tidy(logistic_fit)
# A tibble: 2 × 5
  term          estimate std.error statistic p.value
  <chr>            <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)  -2.27      0.0553     -41.1     0    
2 exclaim_mess  0.000272  0.000949     0.287   0.774

. . .

Fitted equation for the log-odds:

\[ \log\left(\frac{\hat{p}}{1-\hat{p}}\right) = -2.27 + 0.000272\times exclaim~mess \]

Be careful!!

💖✅ This is correct✅💖

\[ \log\left(\frac{\hat{p}}{1-\hat{p}}\right) = -2.27 + 0.000272\times exclaim~mess \]


❌ 🛑These are wrong! Do not do this! ❌🛑

\[ \widehat{spam} = -2.27 + 0.000272\times exclaim~mess \]

\[ \widehat{p} = -2.27 + 0.000272\times exclaim~mess \]

Interpreting the intercept

Plug in \(x = 0\):

\[ \log\left(\frac{\widehat{p}}{1-\widehat{p}}\right) = b_0+b_1x. \]

. . .

When \(x = 0\), the estimated probability that \(y = 1\) is

\[ \hat{p} = \frac{e^{b_0}}{1+e^{b_0}} \]

Interpreting the intercept: emails

If exclaim_mess = 0, then

\[ \hat{p}=\widehat{P(y=1)}=\frac{e^{-2.27}}{1+e^{-2.27}}\approx 0.09. \]

So, we estimate that an email with no exclamation marks has a 9% chance of being spam.

Interpreting the slope is tricky

Recall:

\[ \log\left(\frac{\widehat{p}}{1-\widehat{p}}\right) = b_0+b_1x. \]

. . .

Alternatively:

\[ \frac{\widehat{p}}{1-\widehat{p}} = e^{b_0+b_1x} = \color{blue}{e^{b_0}e^{b_1x}} . \]

. . .

If we increase \(x\) by one unit, we have:

\[ \frac{\widehat{p}}{1-\widehat{p}} = e^{b_0}e^{b_1(x+1)} = e^{b_0}e^{b_1x+b_1} = {\color{blue}{e^{b_0}e^{b_1x}}}{\color{red}{e^{b_1}}} . \]

. . .

A one unit increase in \(x\) is associated with a change in odds by a factor of \(e^{b_1}\). Gross!

Sign of the slope is meaningful

A one unit increase in \(x\) is associated with a change in odds by a factor of \(e^{b_1}\).

. . .

  • A positive slope means increasing \(x\) increases the odds (and probability!) that \(y = 1\)
  • A negative slope means increasing \(x\) decreases the odds (and probability!) that \(y = 1\)

Back to the example…

\[ \log\left(\frac{\hat{p}}{1-\hat{p}}\right) = -2.27 + 0.000272\times exclaim~mess \]

If we add one exclamation mark to the model, we predict the odds of an email being spam to be higher by a factor of \(e^{0.000272}\approx 1.000272\) on average.

Logistic regression -> classification?

Step 0: fit the model

Step 1: pick a threshold

Select a number \(0 < p^* < 1\):

  • if \(\text{Prob}(y=1)\leq p^*\), then predict \(\widehat{y}=0\);
  • if \(\text{Prob}(y=1)> p^*\), then predict \(\widehat{y}=1\).

Step 2: find the “decision boundary”

Solve for the x-value that matches the threshold:

  • if \(\text{Prob}(y=1)\leq p^*\), then predict \(\widehat{y}=0\);
  • if \(\text{Prob}(y=1)> p^*\), then predict \(\widehat{y}=1\).

Step 3: classify a new arrival

A new person shows up with \(x_{\text{new}}\). Which side of the boundary are they on?

  • if \(x_{\text{new}} \leq x^\star\), then \(\text{Prob}(y=1)\leq p^*\), so predict \(\widehat{y}=0\) for the new person;
  • if \(x_{\text{new}} > x^\star\), then \(\text{Prob}(y=1)> p^*\), so predict \(\widehat{y}=1\) for the new person.

Let’s change the threshold

A new person shows up with \(x_{\text{new}}\). Which side of the boundary are they on?

  • if \(x_{\text{new}} \leq x^\star\), then \(\text{Prob}(y=1)\leq p^*\), so predict \(\widehat{y}=0\) for the new person;
  • if \(x_{\text{new}} > x^\star\), then \(\text{Prob}(y=1)> p^*\), so predict \(\widehat{y}=1\) for the new person.

Let’s change the threshold

A new person shows up with \(x_{\text{new}}\). Which side of the boundary are they on?

  • if \(x_{\text{new}} \leq x^\star\), then \(\text{Prob}(y=1)\leq p^*\), so predict \(\widehat{y}=0\) for the new person;
  • if \(x_{\text{new}} > x^\star\), then \(\text{Prob}(y=1)> p^*\), so predict \(\widehat{y}=1\) for the new person.

Nothing special about one predictor…

Two numerical predictors and one binary response:

“Multiple” logistic regression

On the probability scale:

\[ \text{Prob}(y = 1) = \frac{e^{\beta_0+\beta_1x_1+\beta_2x_2+...+\beta_mx_m}}{1+e^{\beta_0+\beta_1x_1+\beta_2x_2+...+\beta_mx_m}}. \]

. . .

For the log-odds, a multiple linear regression:

\[ \log\left(\frac{p}{1-p}\right) = \beta_0+\beta_1x_1+\beta_2x_2+...+\beta_mx_m. \]

Decision boundary, again

Consider two numerical predictors:

  • if new \((x_1,\,x_2)\) below, \(\text{Prob}(y=1)\leq p^*\). Predict \(\widehat{y}=0\) for the new person;
  • if new \((x_1,\,x_2)\) above, \(\text{Prob}(y=1)> p^*\). Predict \(\widehat{y}=1\) for the new person.

Decision boundary, again

It’s linear! Consider two numerical predictors:

  • if new \((x_1,\,x_2)\) below, \(\text{Prob}(y=1)\leq p^*\). Predict \(\widehat{y}=0\) for the new person;
  • if new \((x_1,\,x_2)\) above, \(\text{Prob}(y=1)> p^*\). Predict \(\widehat{y}=1\) for the new person.

Decision boundary, again

It’s linear! Consider two numerical predictors:

  • if new \((x_1,\,x_2)\) below, \(\text{Prob}(y=1)\leq p^*\). Predict \(\widehat{y}=0\) for the new person;
  • if new \((x_1,\,x_2)\) above, \(\text{Prob}(y=1)> p^*\). Predict \(\widehat{y}=1\) for the new person.

Note: the classifier isn’t perfect

  • Blue points in the orange region: spam (1) emails misclassified as legit (0);
  • Orange points in the blue region: legit (0) emails misclassified as spam (1).

How do you pick the threshold?

To balance out the two kinds of errors:

  • High threshold:
    • Hard to classify as 1, so FP less likely and FN more likely
  • Low threshold:
    • Easy to classify as 1, so FP more likely and FN less likely

Silly examples

  • Set p* = 0

    • Classify every email as spam (1);
    • No false negatives, but a lot of false positives;
  • Set p* = 1

    • Classify every email as legit (0);
    • No false positives, but a lot of false negatives.

You pick a threshold in between to strike a balance. The exact number depends on context.

ae-14-spam-filter

  • Go to your ae project in RStudio.

  • If you haven’t yet done so, make sure all of your changes up to this point are committed and pushed, i.e., there’s nothing left in your Git pane.

  • If you haven’t yet done so, click Pull to get today’s application exercise file: ae-14-spam-filter.qmd.

  • Work through the application exercise in class, and render, commit, and push your edits.