
STAT 20: Introduction to Probability and Statistics

While you’re waiting

If you’ve been given an index card, please write on it:

  1. Your first name
  2. Your year at Cal (1 is first year, 2 is second year, etc)
  3. Whether or not you are interested in majoring in a business- or econ-related field. 1 = yes, 0 = no


  • Concept Question
  • Activity: The Bootstrap
  • PS: The Bootstrap
  • Bootstrapping with infer

Concept Question

Which of these is a valid bootstrap sample?


Original Sample
name species length
Gus Chinstrap 50.7
Luz Gentoo 48.5
Ida Chinstrap 52.8
Ola Gentoo 44.5
Abe Adelie 42.0
name species length
Ida Chinstrap 52.8
Luz Gentoo 48.5
Abe Adelie 42.0
Ola Gentoo 44.5
Ida Chinstrap 52.8
name species length
Ola Gentoo 44.5
Gus Chinstrap 50.7
Ida Chinstrap 52.8
Luz Gentoo 48.5
Gus Chinstrap 50.7
Gus Chinstrap 50.7
name species length
Gus Chinstrap 50.7
Ola Gentoo 48.5
Ola Chinstrap 52.8
Ida Gentoo 44.5
Ida Adelie 42.0
name species length
Gus Chinstrap 50.7
Abe Adelie 42.0
Gus Chinstrap 50.7
Gus Chinstrap 50.7
Gus Chinstrap 50.7

The Bootstrap

Parameters and Statistics

Our Goal: Assess the sampling error / variability in our estimate of the median year at Cal and the proportion of students in an econ-related field.

Our Tool: The Bootstrap

Collecting a sample of data

If you’ve been given an index card, please write on it:

  1. Your first name
  2. Your year at Cal (1 is first year, 2 is second year, etc)
  3. Whether or not you are interested in majoring in a business- or econ-related field. 1 = yes, 0 = no


Problem Set


Bootstrapping with Infer

Example: Penguins

Let’s consider our 344 penguins to be a SRS from the broader population of Antarctic penguins. What is a point and interval estimate for the population proportion of penguins that are Adelie?

penguins <- penguins |>
  mutate(is_adelie = species == "Adelie")

penguins |>
  ggplot(aes(x = is_adelie)) +

Point estimate

obs_stat <- penguins |>
  summarize(p_adelie = mean(is_adelie))
# A tibble: 1 × 1
1    0.442

Generating one bootstrap sample

penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 1, 
           type = "bootstrap")
Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups:   replicate [1]
   replicate is_adelie
       <int> <fct>    
 1         1 FALSE    
 2         1 FALSE    
 3         1 TRUE     
 4         1 FALSE    
 5         1 TRUE     
 6         1 TRUE     
 7         1 FALSE    
 8         1 TRUE     
 9         1 TRUE     
10         1 TRUE     
# ℹ 334 more rows

Two more bootstrap samples

penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 1, 
           type = "bootstrap")
Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups:   replicate [1]
   replicate is_adelie
       <int> <fct>    
 1         1 FALSE    
 2         1 TRUE     
 3         1 FALSE    
 4         1 FALSE    
 5         1 FALSE    
 6         1 TRUE     
 7         1 TRUE     
 8         1 FALSE    
 9         1 FALSE    
10         1 FALSE    
# ℹ 334 more rows
penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 1, 
           type = "bootstrap")
Response: is_adelie (factor)
# A tibble: 344 × 2
# Groups:   replicate [1]
   replicate is_adelie
       <int> <fct>    
 1         1 FALSE    
 2         1 TRUE     
 3         1 TRUE     
 4         1 FALSE    
 5         1 FALSE    
 6         1 TRUE     
 7         1 TRUE     
 8         1 FALSE    
 9         1 FALSE    
10         1 FALSE    
# ℹ 334 more rows

Visualizing 9 bs samples

penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 9, 
           type = "bootstrap") |>
  ggplot(aes(x = is_adelie)) +
  geom_bar() +
             nrow = 3)

Calculating 9 \(\hat{p}\)

penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 9, 
           type = "bootstrap") |>
  calculate(stat = "prop")
Response: is_adelie (factor)
# A tibble: 9 × 2
  replicate  stat
      <int> <dbl>
1         1 0.404
2         2 0.430
3         3 0.404
4         4 0.433
5         5 0.468
6         6 0.448
7         7 0.427
8         8 0.413
9         9 0.474

Note the change in data frame size.

The bootstrap dist (reps = 500)

penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 500, 
           type = "bootstrap") |>
  calculate(stat = "prop") |>
  ggplot(aes(x = stat)) +

Interval Estimate

We can extract the middle 95% by identifying the .025 quantile and the .975 quantile of the bootstrap distribution with get_ci().

penguins |>
  specify(response = is_adelie,
          success = "TRUE") |>
  generate(reps = 500, 
           type = "bootstrap") |>
  calculate(stat = "prop") |>
  get_ci(level = .95)
# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1    0.392    0.494


Your Turn

Create a 95% confidence interval for the median bill length of penguins.
