
Lecture 20
June 12, 2025
Project timeline and progress:
Quick Recap!


Generally, we don’t have information about an entire population, we have data for a sample:
Election polling: we can’t ask everybody who they are voting for
Housing data: we don’t have a data set with every single house
Medical research: we don’t test a new drug on everyone, just on a sample of patients in a clinical trial.
Ecology studies: scientists might analyze a smaller sample of animals/plants
There is uncertainty about true population parameters.
Statistical inference provide methods and tools so we can use the single observed sample to make valid statements (inferences) about the population it comes from
For our inferences to be valid, the sample should be random and representative of the population we’re interested in
Suppose this histogram represents some value \(x\) in a population of 10,000:

Mean: 55.06

Mean: 55.06

Mean: 55.9

Mean: 55.06

Mean: 47.8

Mean: 55.06

Mean: 56.08

Mean: 55.06

Mean: 54.38

Mean: 55.06

Mean: 59.24

Mean: 55.06

Mean: 53.29

Mean: 55.06

Mean: 51.59

Mean: 55.06

Mean: 55.24

Mean: 55.06

Mean: 57.99

Mean: 55.06

Mean: 56.54

Mean: 55.06

Mean: 55.55

Mean: 55.06

Mean: 54.84

Mean: 55.06

Mean: 54.98

Mean: 55.06

Mean: 54.58

Mean: 55.06

Mean: 55.57

Mean: 55.06

Mean: 55.12

Mean: 55.06

Mean: 55.29

Mean: 55.06

Mean: 55.07

Mean: 55.06

Mean: 55.06

Mean: 55.06

Mean: 55.05
How confident would you feel stating that the population mean is equal to any of these values?
Suppose this scatter plot represents some values \(x, y\) in a population of 10,000:

Slope: 0.81

Slope: 0.81

Slope: 5.97

Slope: 0.81

Slope: 1.31

Slope: 0.81

Slope: 2.57

Slope: 0.81

Slope: -5.51

Slope: 0.81

Slope: 1.08

Slope: 0.81

Slope: 1.01

Slope: 0.81

Slope: 1.41

Slope: 0.81

Slope: 0.45

Slope: 0.81

Slope: 1.06

Slope: 0.81

Slope: 0.68

Slope: 0.81

Slope: 1.05

Slope: 0.81

Slope: 0.84

Slope: 0.81

Slope: 0.64

Slope: 0.81

Slope: 0.77

Slope: 0.81

Slope: 0.72

Slope: 0.81

Slope: 0.82
How confident would you feel stating that the population slope is equal to any of these values?s
How can we give a range of reasonable values for the population data using the sample data we have??
Idea: use the sample to take more samples???
Method: bootstrapping!







Find range of plausible values for a slope using bootstrap confidence intervals.
openintro::duke_forest

Goal: Use the area (in square feet) to understand variability in the price of houses in Duke Forest.
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 116652. 53302. 2.19 3.11e- 2
2 area 159. 18.2 8.78 6.29e-14
Intercept: Duke Forest houses that are 0 square feet are expected to sell, for $116,652, on average.
Slope: For each additional square foot, we expect the sale price of Duke Forest houses to be higher by $159, on average.
For each additional square foot, we expect the sale price of Duke Forest houses to be higher by $159, on average.
Calculate a confidence interval for the slope, \(\beta_1\) (today)
Conduct a hypothesis test for the slope,\(\beta_1\) (next week)
A confidence interval will allow us to make a statement like “For each additional square foot, the model predicts the sale price of Duke Forest houses to be higher, on average, by $159, plus or minus X dollars.”
Should X be $10? $100? $1000?
If we were to take another sample of 98 would we expect the slope calculated based on that sample to be exactly $159? Off by $10? $100? $1000?
Bootstrapping to quantify the variability of the slope for the purpose of estimation:










so on and so forth…




Fill in the blank: For each additional square foot, the model predicts the sale price of Duke Forest houses to be higher, on average, by $159, plus or minus ___ dollars.

Fill in the blank: For each additional square foot, we expect the sale price of Duke Forest houses to be higher, on average, by $159, plus or minus ___ dollars.

How confident are you that the true slope is between $0 and $250? How about $150 and $170? How about $90 and $210?


Calculate the observed slope:
Take 100 bootstrap samples and fit models to each one:
set.seed(1234)
n = 100
boot_fits <- duke_forest |>
specify(price ~ area) |>
generate(reps = n, type = "bootstrap") |>
fit()
boot_fits# A tibble: 200 × 3
# Groups: replicate [100]
replicate term estimate
<int> <chr> <dbl>
1 1 intercept 177388.
2 1 area 125.
3 2 intercept 161078.
4 2 area 150.
5 3 intercept 202354.
6 3 area 118.
7 4 intercept 120750.
8 4 area 162.
9 5 intercept 52127.
10 5 area 180.
# ℹ 190 more rows
Percentile method: Compute the 95% CI as the middle 95% of the bootstrap distribution:
If we want to be very certain that we capture the population parameter, should we use a wider or a narrower interval? What drawbacks are associated with using a wider interval?

How can we get best of both worlds – high precision and high accuracy?
How would you modify the following code to calculate a 90% confidence interval? How would you modify it for a 99% confidence interval?
## confidence level: 90%
get_confidence_interval(
boot_fits, point_estimate = observed_fit,
level = 0.90, type = "percentile"
)# A tibble: 2 × 3
term lower_ci upper_ci
<chr> <dbl> <dbl>
1 area 96.1 197.
2 intercept 20626. 268049.
## confidence level: 99%
get_confidence_interval(
boot_fits, point_estimate = observed_fit,
level = 0.99, type = "percentile"
)# A tibble: 2 × 3
term lower_ci upper_ci
<chr> <dbl> <dbl>
1 area 69.7 214.
2 intercept -25879. 351614.
Population: Complete set of observations of whatever we are studying, e.g., people, tweets, photographs, etc. (population size = \(N\))
Sample: Subset of the population, ideally random and representative (sample size = \(n\))
Sample statistic \(\ne\) population parameter, but if the sample is good, it can be a good estimate
Statistical inference: Discipline that concerns itself with extracting meaning and information from data that has been generated by random process
We report the estimate with a confidence interval. The width of this interval depends on the variability of sample statistics from different samples from the population
Since we can’t continue sampling from the population, we bootstrap from the one sample we have to estimate sampling variability