Bootstrapping

STAT 20: Introduction to Probability and Statistics

Agenda

Announcements
Reading Questions
Break
Worksheet: Bootstrapping
Appendix

Announcements

Portfolio 5 due Friday at 5pm

Lab 3 due Wednesday at 12pm

Quiz 3 next Thursday.

Reading Questions

Please put your laptops under your desk and your phones away.
Write your name, ID, and bubble in Version “A” on your answer sheet.
You may work only with those at your table!

Read this first.

From a population of size 1,000, we take a sample of size 100. We then create a bootstrapped percentile confidence interval for the population median using 10,000 statistics. Each of these statistics was generated from one bootstrap sample.

How large is each bootstrap sample?

A: 10000
B: 1000
C: 100
D: The correct answer is not listed here.

00:40

Suppose we change the sample size from 100 to something else. Which of the following sizes makes the bootstrap method for creating a confidence interval most prone to failure?

A: 200
B: 75
C: 50
D: 20

00:40

True or false: a bootstrap sample is taken with replacement.

A: True
B: False

00:30

We have obtained a confidence interval for the regression coefficient \(b1\) in the model \(\hat{y} = b_0 + b_1x\) using the bootstrap. The interval for \(b1\) does not contain 0. What does this suggest?

A: In the population, there is a linear association between \(x\) and \(y\).
B: Within our table, there is a linear association between \(x\) and \(y\).
C: In the population, there is no linear association between \(x\) and \(y\).
D: Within our table, there is no linear association between \(x\) and \(y\).
E: In the population, there is a causal link between \(x\) and \(y\).

00:30

Break

05:00

Worksheet: Bootstrapping

40:00

Appendix

Concept Questions

Which of these is a valid bootstrap sample?

01:00

Original Sample
name	species	length
Gus	Chinstrap	50.7
Luz	Gentoo	48.5
Ida	Chinstrap	52.8
Ola	Gentoo	44.5
Abe	Adelie	42.0

BS A
name	species	length
Ida	Chinstrap	52.8
Luz	Gentoo	48.5
Abe	Adelie	42.0
Ola	Gentoo	44.5
Ida	Chinstrap	52.8

BS B
name	species	length
Ola	Gentoo	44.5
Gus	Chinstrap	50.7
Ida	Chinstrap	52.8
Luz	Gentoo	48.5
Gus	Chinstrap	50.7
Gus	Chinstrap	50.7

BS C
name	species	length
Gus	Chinstrap	50.7
Ola	Gentoo	48.5
Ola	Chinstrap	52.8
Ida	Gentoo	44.5
Ida	Adelie	42.0

BS D
name	species	length
Gus	Chinstrap	50.7
Abe	Adelie	42.0
Gus	Chinstrap	50.7
Gus	Chinstrap	50.7
Gus	Chinstrap	50.7

The Bootstrap

Parameters and Statistics

Our Goal: Assess the sampling variability in our estimate of the median year at Cal and the proportion of students in an econ-related field.

Our Tool: The Bootstrap

Collecting a sample of data

If you’ve been given an index card, please write on it:

Your first name
Your year at Cal (1 is first year, 2 is second year, etc)
Whether or not you are interested in majoring in a business- or econ-related field. 1 = yes, 0 = no

boardwork

Collect index cards from students and record the data into a data frame on the board labelled “Observed sample”. Calculate the sample median and sample proportion of econ-related majors.

Ask for a volunteer to generate the first bootstrap sample. Hand them the stack of cards and have them randomly choose a single card and read off the data to you. As they do so, write out the first row of a “Bootstrap Sample 1” data frame on the board. Be sure to label the row with the student name - that helps emphasis when there are repeats.Have them return the card to the deck, shuffle, and randomly choose a card and read off the data. Repeat until you have filled out the same number of rows as in the original data set. Calculate the median and proportion (you may want to write dplyr code to do this using summarize()).

Ask for a second volunteer to generate the second bootstrap sample. Repeat the procedure as before, drawing a third data frame on the board and computing a second set of statistics (median and proportion).

Collect the bootstrapped medians and proportions and sketch them as the first few points in a broader density plot that we’ll be able to see when we take more and more bootstrap samples. Label this as the “bootstrap distribution” and speak of it as an approximation to the true sampling distribution. You can explain the 1 - alpha bootstrap interval as the interval that captures the middle 95% of bootstrapped statistics.

Bootstrapping with Infer

Example: Penguins

Let’s consider our 344 penguins to be a SRS from the broader population of Antarctic penguins. What is a point and interval estimate for the population proportion of penguins that are Adelie?

penguins <- penguins |>
  mutate(is_adelie = species == "Adelie")

penguins |>
  ggplot(aes(x = is_adelie)) +
  geom_bar()

Point estimate

obs_stat <- penguins |>
  summarize(p_adelie = mean(is_adelie))
obs_stat

# A tibble: 1 × 1
  p_adelie
     <dbl>
1    0.442

Generating one bootstrap sample

library(infer)
penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 1, 
           type = "bootstrap")

Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups:   replicate [1]
   replicate is_adelie
       <int> <fct>    
 1         1 FALSE    
 2         1 FALSE    
 3         1 TRUE     
 4         1 FALSE    
 5         1 TRUE     
 6         1 TRUE     
 7         1 FALSE    
 8         1 TRUE     
 9         1 TRUE     
10         1 TRUE     
# ℹ 334 more rows

Two more bootstrap samples

penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 1, 
           type = "bootstrap")

Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups:   replicate [1]
   replicate is_adelie
       <int> <fct>    
 1         1 FALSE    
 2         1 TRUE     
 3         1 FALSE    
 4         1 FALSE    
 5         1 FALSE    
 6         1 TRUE     
 7         1 TRUE     
 8         1 FALSE    
 9         1 FALSE    
10         1 FALSE    
# ℹ 334 more rows

penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 1, 
           type = "bootstrap")

Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups:   replicate [1]
   replicate is_adelie
       <int> <fct>    
 1         1 FALSE    
 2         1 TRUE     
 3         1 TRUE     
 4         1 FALSE    
 5         1 FALSE    
 6         1 TRUE     
 7         1 TRUE     
 8         1 FALSE    
 9         1 FALSE    
10         1 FALSE    
# ℹ 334 more rows

Visualizing 9 bs samples

penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 9, 
           type = "bootstrap") |>
  ggplot(aes(x = is_adelie)) +
  geom_bar() +
  facet_wrap(vars(replicate),
             nrow = 3)

Calculating 9 \(\hat{p}\)

penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 9, 
           type = "bootstrap") |>
  calculate(stat = "prop")

Response: is_adelie (factor)
# A tibble: 9 × 2
  replicate  stat
      <int> <dbl>
1         1 0.404
2         2 0.430
3         3 0.404
4         4 0.433
5         5 0.468
6         6 0.448
7         7 0.427
8         8 0.413
9         9 0.474

Note the change in data frame size.

The bootstrap dist (reps = 500)

penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 500, 
           type = "bootstrap") |>
  calculate(stat = "prop") |>
  ggplot(aes(x = stat)) +
  geom_histogram()

Interval Estimate

We can extract the middle 95% by identifying the .025 quantile and the .975 quantile of the bootstrap distribution with get_ci().

penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 500, 
           type = "bootstrap") |>
  calculate(stat = "prop") |>
  get_ci(level = .95)

# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    0.392    0.494

Documentation: `infer.tidymodels.org`

Your Turn

Create a 95% confidence interval for the median bill length of penguins.