AE 17:Hypothesis Testing
In this application exercise, we will do hypothesis testing for the slope in the linear model.
Packages
We will use tidyverse and tidymodels for data exploration and modeling, respectively, and the openintro package for the data.
Data
In this exercise we will be studying the relationship between mean household income and the percent of population without insurance in counties in the United States. The contains results for 3142 counties’ results from the 2019 American Community Survey (this is all counties at the time). The data set county_2019
is from the openintro
package. Go ahead and take a look at the data using glimpse
; you can get more information with ?county_2019
!
# add code here
Now, let’s imagine we only had a very small subset of these data to work with. That is, we only have data from a small subset of counties instead of from all counties. While that is not the case in this data set, we could easily images cases/studies where this might be true.
In the code below, we set a seed then take a random sample of 25 counties from our county_2019
data.
Question: With so little information, can we draw super strong conclusions?
Plot
Task: First, plot the relationship between mean household income mean_household_income
and uninsured
for the full data set.
# add code here
Task: Next, plot the relationship between mean household income mean_household_income
and uninsured
for the small sample data set.
# add code here
Inference with the small sample dataset
Point estimate
Task: First, compute the point estimate using the small sample dataset. Tip: This should be done in the same way we did during our confidence interval lesson (using specify
).
# add code here
Simulate the null distribution
Simulate the null distribution. We are testing \(H_0: \beta_1=0\) versus the alternative \(H_A: \beta_1\neq 0\).
set.seed(123)
<- ________ |>
null_dist specify(______ ~ _______) |>
hypothesize(null = ________) |>
generate(reps = 1000, type = ________) |>
fit()
Now, we are going to visualize the null distribution. Note that it’s centered at zero, because if the null were true and the true slope was in fact zero, we would expect noisy, imperfect estimates of the slope to wiggle around 0.
|>
null_dist filter(______ == ______) |>
ggplot(aes(x = _______)) +
geom_histogram()
Where does our actual point estimate fall under the null distribution?
Shade the \(p\)-value:
_______(null_dist) +
_______(obs_stat = observed_fit, direction = "two-sided")
Compute the \(p\)-value:
|>
null_dist _________(obs_stat = observed_fit, direction = "two-sided")
Interpretation: Explain how we can interpret the p-value here.