AE 17:Hypothesis Testing

Application exercise

In this application exercise, we will do hypothesis testing for the slope in the linear model.

Packages

We will use tidyverse and tidymodels for data exploration and modeling, respectively, and the openintro package for the data.

Data

In this exercise we will be studying the relationship between mean household income and the percent of population without insurance in counties in the United States. The contains results for 3142 counties’ results from the 2019 American Community Survey (this is all counties at the time). The data set county_2019 is from the openintro package. Go ahead and take a look at the data using glimpse; you can get more information with ?county_2019!

# add code here

Now, let’s imagine we only had a very small subset of these data to work with. That is, we only have data from a small subset of counties instead of from all counties. While that is not the case in this data set, we could easily images cases/studies where this might be true.

In the code below, we set a seed then take a random sample of 25 counties from our county_2019 data.

set.seed(123)
counties_sample <- county_2019 |>
  sample_n(25)

Question: With so little information, can we draw super strong conclusions?

Plot

Task: First, plot the relationship between mean household income mean_household_income and uninsured for the full data set.

# add code here

Task: Next, plot the relationship between mean household income mean_household_income and uninsured for the small sample data set.

# add code here

Inference with the small sample dataset

Point estimate

Task: First, compute the point estimate using the small sample dataset. Tip: This should be done in the same way we did during our confidence interval lesson (using specify).

# add code here

Simulate the null distribution

Simulate the null distribution. We are testing \(H_0: \beta_1=0\) versus the alternative \(H_A: \beta_1\neq 0\).

set.seed(123)
null_dist <- ________ |>
  specify(______ ~ _______) |>
  hypothesize(null =  ________) |>
  generate(reps = 1000, type = ________) |>
  fit()

Now, we are going to visualize the null distribution. Note that it’s centered at zero, because if the null were true and the true slope was in fact zero, we would expect noisy, imperfect estimates of the slope to wiggle around 0.

null_dist |> 
  filter(______ == ______) |>
  ggplot(aes(x = _______)) + 
  geom_histogram()

Where does our actual point estimate fall under the null distribution?

Shade the \(p\)-value:

_______(null_dist) +
  _______(obs_stat = observed_fit, direction = "two-sided")

Compute the \(p\)-value:

null_dist |>
  _________(obs_stat = observed_fit, direction = "two-sided")

Interpretation: Explain how we can interpret the p-value here.