AE 18: Stocks!!

Application exercise

Today we’ll explore the question “How do stock prices of Apple and Microsoft relate to each other?”

Packages

library(tidyverse)
library(tidymodels)

Data

The data for this application exercise was originally gathered using the tidyquant R package. It features Apple and Microsoft stock prices from January 1st 2020 to December 31st 2021.

First, let’s load the data:

stocks <- read_csv("data/stocks.csv")

To keep things simple, we’ll work with a subset of the data, stock prices in January 2020.

stocks_jan2020 <- stocks |>
  filter(month(date) == 1 & year(date) == 2020)

glimpse(stocks_jan2020)

These are daily data. For each stock, we have several variables, including:

Open: price at the start of the trading day;
High: highest price during the trading day;
Low: lowest price during the trading day;
Close: price at the end of the trading day;

We will focus on using the Microsoft open to predict the Apple open.

Visualize

Your turn: Plot the data and the line of best fit.

# add code here

Point Estimation

Your turn: Fit the linear model to these data. Save the result as observed_fit and view it in a tidy way.

# add codehere

Note

This gives the exact same numbers that you get if you use linear_reg() |> fit(), but we need this new syntax because it plays nice with the tools we have for confidence intervals and hypothesis tests. I know, I hate it too, but it’s the way it is.

Your turn: Typeset the equation for the model fit:

Add answer here.

Your turn: Interpret the slope and the intercept estimates:

Add answer here.

Interval Estimation

Fill in the blanks: Using seed 8675309, generate 500 bootstrap samples, and store them in a new data frame called bstrap_samples.

set.seed(8675309)
bstrap_samples <- stocks_jan2020 |>
  _______(_____ ~ _____) |>
  ________(reps = _____, type = ______)

Fill in the blank: Fit a linear model to each of these bootstrap samples and store the estimates in a new data framed called bstrap_fits.

bstrap_fits <- ______ |> 
  fit()

**Demo: Examine the code below. Here, we are using linear_reg() |> fit(...) to fit a linear model to bootstrap sample number 347*, and verifying that we get the same estimates as the ones contained in bstrap_fits.

replicate_347 <- bstrap_samples |>
  filter(
    replicate == 347
  )

linear_reg() |>
  fit(AAPL.Open ~ MSFT.Open, data = replicate_347) |>
  tidy()

bstrap_fits |>
  filter(replicate == 347)

Note

The only point I’m making here is that this new bootstrap code is not performing a fundamentally new task. It’s performing an old task (fitting the linear model), but it’s repeating it A LOT. So the numbers you get here are not mysterious. They’re numbers you already know how to compute.

Fill in the blank: Compute 90% confidence intervals for the slope and the intercept using the get_confidence_interval command.

ci_90 <- get_confidence_interval(
  _____,
  point_estimate = ______,
  level = ______,
  type = ______
)
ci_90

Demo: Examine the code below. Here, we are verifying that we get the same numbers when you manually calculate the quantiles of the slope estimates using summarize and quantile. Pay attention to the grouping.

bstrap_fits |>
  ungroup() |>
  group_by(term) |>
  summarize(
    lower_ci = quantile(estimate, 0.05),
    upper_ci = quantile(estimate, 0.95)
  )

Note

Same point as before. There’s no magic here. get_confidence_interval is just a convenient way of doing something that you already knew how to do.s

BONUS: You can visualize the confidence interval:

visualize(bstrap_fits) + 
  shade_confidence_interval(ci_90)

Hypothesis Testing

Let’s consider the hypotheses:

\[ H_0:\beta_1=0\quad vs\quad H_A: \beta_1\neq 0. \] The null hypothesis corresponds to the claim that Microsoft and Apple stock prices are uncorrelated. Probably bogus.

Fill in the blank: Simulate and plot the null distribution for the slope:

set.seed(20241118)
null_dist <- stocks_jan2020 |>
  specify(______ ~ ______) |>
  hypothesize(null = _______) |>
  generate(reps = 500, type = _______) |>
  fit()

null_dist |>
  filter(_______ == _______) |>
  ggplot(aes(x = ______)) + 
  geom_histogram()

Fill in the blank: Add a vertical line to your plot indicating the point estimate of the slope from your original data analysis:

visualize(null_dist) +
  shade_p_value(obs_stat = _______, direction = _______)

Compute the \(p\)-value for this test and interpret it:

null_dist |>
  get_p_value(obs_stat = observed_fit, direction = "two-sided")

Add interpretation here!

New Question:

Are Apple open prices different in the early week (Mondays and Tuesdays) than the late week (Wednesday - Friday) on average?

Run this code to create a part of week variable!

stocks <- stocks |> 
  mutate(part_of_week = if_else(weekdays(date) 
                                %in% c("Monday", "Tuesday"), "Early", "Late"))

Visualize:

Task: Create a visualization to represent Apple open stock prices in the early vs. late week.

# add code here

Point Estimate

Task: Use specify to get a point estimate of the difference in means between the two groups.

# add code here

Hypothesis Test

We want to test the hypothesis of whether the opening stock prices early in the week look, on average, like prices later in the week.

Task: Write the hypothesis we want to test using proper notation.

Task: Write code to implement this test. What are your conclusions?

# add code here