Bootstrapping

STAT 20: Introduction to Probability and Statistics

Agenda

  • Announcements
  • Reading Questions
  • Break
  • Worksheet: Bootstrapping
  • Appendix

Announcements

  • Portfolio 5 due Friday at 5pm
  • Lab 3 due Wednesday at 12pm
  • Quiz 3 next Thursday.

Reading Questions

  • Please put your laptops under your desk and your phones away.
  • Write your name, ID, and bubble in Version “A” on your answer sheet.
  • You may work only with those at your table!

Read this first.

From a population of size 1,000, we take a sample of size 100. We then create a bootstrapped percentile confidence interval for the population median using 10,000 statistics. Each of these statistics was generated from one bootstrap sample.

How large is each bootstrap sample?

  • A: 10000
  • B: 1000
  • C: 100
  • D: The correct answer is not listed here.
00:40

Suppose we change the sample size from 100 to something else. Which of the following sizes makes the bootstrap method for creating a confidence interval most prone to failure?

  • A: 200
  • B: 75
  • C: 50
  • D: 20
00:40

True or false: a bootstrap sample is taken with replacement.

  • A: True
  • B: False
00:30

We have obtained a confidence interval for the regression coefficient \(b1\) in the model \(\hat{y} = b_0 + b_1x\) using the bootstrap. The interval for \(b1\) does not contain 0. What does this suggest?

  • A: In the population, there is a linear association between \(x\) and \(y\).
  • B: Within our table, there is a linear association between \(x\) and \(y\).
  • C: In the population, there is no linear association between \(x\) and \(y\).
  • D: Within our table, there is no linear association between \(x\) and \(y\).
  • E: In the population, there is a causal link between \(x\) and \(y\).
00:30

Break

05:00

Worksheet: Bootstrapping

40:00

Appendix

Concept Questions

Which of these is a valid bootstrap sample?

01:00




Original Sample
name species length
Gus Chinstrap 50.7
Luz Gentoo 48.5
Ida Chinstrap 52.8
Ola Gentoo 44.5
Abe Adelie 42.0
BS A
name species length
Ida Chinstrap 52.8
Luz Gentoo 48.5
Abe Adelie 42.0
Ola Gentoo 44.5
Ida Chinstrap 52.8
BS B
name species length
Ola Gentoo 44.5
Gus Chinstrap 50.7
Ida Chinstrap 52.8
Luz Gentoo 48.5
Gus Chinstrap 50.7
Gus Chinstrap 50.7
BS C
name species length
Gus Chinstrap 50.7
Ola Gentoo 48.5
Ola Chinstrap 52.8
Ida Gentoo 44.5
Ida Adelie 42.0
BS D
name species length
Gus Chinstrap 50.7
Abe Adelie 42.0
Gus Chinstrap 50.7
Gus Chinstrap 50.7
Gus Chinstrap 50.7

The Bootstrap

Parameters and Statistics


Our Goal: Assess the sampling variability in our estimate of the median year at Cal and the proportion of students in an econ-related field.


Our Tool: The Bootstrap

Collecting a sample of data

If you’ve been given an index card, please write on it:

  1. Your first name
  2. Your year at Cal (1 is first year, 2 is second year, etc)
  3. Whether or not you are interested in majoring in a business- or econ-related field. 1 = yes, 0 = no

boardwork

Bootstrapping with Infer

Example: Penguins

Let’s consider our 344 penguins to be a SRS from the broader population of Antarctic penguins. What is a point and interval estimate for the population proportion of penguins that are Adelie?


penguins <- penguins |>
  mutate(is_adelie = species == "Adelie")

penguins |>
  ggplot(aes(x = is_adelie)) +
  geom_bar()




Point estimate

obs_stat <- penguins |>
  summarize(p_adelie = mean(is_adelie))
obs_stat
# A tibble: 1 × 1
  p_adelie
     <dbl>
1    0.442

Generating one bootstrap sample

library(infer)
penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 1, 
           type = "bootstrap")
Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups:   replicate [1]
   replicate is_adelie
       <int> <fct>    
 1         1 FALSE    
 2         1 FALSE    
 3         1 TRUE     
 4         1 FALSE    
 5         1 TRUE     
 6         1 TRUE     
 7         1 FALSE    
 8         1 TRUE     
 9         1 TRUE     
10         1 TRUE     
# ℹ 334 more rows

Two more bootstrap samples

penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 1, 
           type = "bootstrap")
Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups:   replicate [1]
   replicate is_adelie
       <int> <fct>    
 1         1 FALSE    
 2         1 TRUE     
 3         1 FALSE    
 4         1 FALSE    
 5         1 FALSE    
 6         1 TRUE     
 7         1 TRUE     
 8         1 FALSE    
 9         1 FALSE    
10         1 FALSE    
# ℹ 334 more rows
penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 1, 
           type = "bootstrap")
Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups:   replicate [1]
   replicate is_adelie
       <int> <fct>    
 1         1 FALSE    
 2         1 TRUE     
 3         1 TRUE     
 4         1 FALSE    
 5         1 FALSE    
 6         1 TRUE     
 7         1 TRUE     
 8         1 FALSE    
 9         1 FALSE    
10         1 FALSE    
# ℹ 334 more rows

Visualizing 9 bs samples

penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 9, 
           type = "bootstrap") |>
  ggplot(aes(x = is_adelie)) +
  geom_bar() +
  facet_wrap(vars(replicate),
             nrow = 3)

Calculating 9 \(\hat{p}\)

penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 9, 
           type = "bootstrap") |>
  calculate(stat = "prop")
Response: is_adelie (factor)
# A tibble: 9 × 2
  replicate  stat
      <int> <dbl>
1         1 0.404
2         2 0.430
3         3 0.404
4         4 0.433
5         5 0.468
6         6 0.448
7         7 0.427
8         8 0.413
9         9 0.474

Note the change in data frame size.

The bootstrap dist (reps = 500)

penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 500, 
           type = "bootstrap") |>
  calculate(stat = "prop") |>
  ggplot(aes(x = stat)) +
  geom_histogram()

Interval Estimate

We can extract the middle 95% by identifying the .025 quantile and the .975 quantile of the bootstrap distribution with get_ci().

penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 500, 
           type = "bootstrap") |>
  calculate(stat = "prop") |>
  get_ci(level = .95)
# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    0.392    0.494

Documentation: infer.tidymodels.org

Your Turn

Create a 95% confidence interval for the median bill length of penguins.

05:00