Conditioning

STAT 20: Introduction to Probability and Statistics

Agenda

  • Announcements
  • Reading Questions: Conditioning
  • Break
  • Worksheet: Conditioning
  • Break
  • Lab 2: Flights

Announcements

  • You are allowed one, one-sided, handwritten cheatsheet for the Quiz on Thursday.
  • Portfolio 2 due Thursday, not Friday, at 8pm.

Reading Questions

  • Please put your laptops under your desk and your phones away.
  • Write your name, ID, and bubble in Version “A” on your answer sheet.
  • You may work only with those at your table!

Which of the following pieces of code will not cause an error?

  • A
penguins 
  |> mutate(bill_size = bill_len * bill_dep) 
  |> select(bill_size)
  • B
penguins |> 
  mutate(bill_size = bill_len * bill_dep) |>
  select(bill_size)
  • C
penguins |> 
  mutate(penguins, bill_size = bill_len * bill_dep) |>
  select(penguins, bill_size)
00:45

Which option describes a filter operation?

  • A: Subsetting the rows of a data frame according to their position.

  • B: Subsetting the columns of a data frame based on their names.

  • C: Subsetting the rows of a data frame based on their values of particular variables.

00:30

What will the following command return?

mean(c(TRUE, TRUE, TRUE, FALSE))

  • A TRUE

  • B FALSE

  • C: An error will be produced

  • D: 0.75

  • E: 0.25

00:40

Which of the following lines of code correctly extracts rows from the penguins data frame that are of the desired species?

  • A: filter(penguins, species %in% c("Adelie", "Chinstrap"))

  • B: filter(penguins, species == c("Adelie", "Chinstrap"))

  • C: filter(penguins, species = c("Adelie", "Chinstrap"))

  • D: slice(penguins, species %in% c("Adelie", "Chinstrap"))

  • E: select(penguins, species %in% c("Adelie", "Chinstrap"))

00:30

Break

05:00

Worksheet: Conditioning

30:00

Break

05:00

Lab 2: Flights

30:00

Appendix - more practice!

c("fruit", "fruit", "vegetable") == "fruit"

What will this line of code return?

01:00

Evaluating equivalence, cont.

In R, this evaluation happens element-wise when operating on vectors.

c("fruit", "fruit", "vegetable") == "fruit"
[1]  TRUE  TRUE FALSE
c("fruit", "fruit", "vegetable") != "fruit"
[1] FALSE FALSE  TRUE
c("fruit", "vegetable", "boba") %in% c("fruit", "vegetable")
[1]  TRUE  TRUE FALSE

Question 2

class_survey |>
  filter(coding_exp_scale < 3,
         olympics %in% c("Ice skating", "Speed skating"),
         is_entrepreneur == TRUE)

Which observations will be included in the following data frame?

01:00

Question 3

Which data frame will have fewer rows?

# A
filter(class_survey, time_at_cal == "This is my first semester!")

# B
class_survey |>
  mutate(first_sem = (time_at_cal == "This is my first semester!")) |>
  filter(first_sem)
01:00

Building data pipelines

Consider the subset of students here:

class_survey |>
  filter(coding_exp_scale < 3,
         olympics %in% c("Ice skating", "Speed skating"),
         is_entrepreneur == TRUE)

How do we extract the average of these students’ chance that class will be disrupted by a new COVID variant?

Let’s look at three different ways to answer this question

Nesting

filter(class_survey, 
       coding_exp_scale < 3,
       olympics %in% c("Ice skating", "Speed skating"),
       is_entrepreneur == TRUE)

Nesting

select(filter(class_survey, 
       coding_exp_scale < 3,
       olympics %in% c("Ice skating", "Speed skating"),
       is_entrepreneur == TRUE),
       coding_exp_xcale,
       olympics,
       is_entrepreneur,
       covid)

Nesting

summarize(select(filter(class_survey, 
       coding_exp_scale < 3,
       olympics %in% c("Ice skating", "Speed skating"),
       is_entrepreneur == TRUE),
       coding_exp_scale,
       olympics,
       is_entrepreneur,
       covid),
       covid_avg = mean(covid))

Nesting

summarize(select(filter(class_survey, 
       coding_exp_scale < 3,
       olympics %in% c("Ice skating", "Speed skating"),
       is_entrepreneur == TRUE),
       coding_exp_scale,
       olympics,
       is_entrepreneur,
       covid),
       covid_avg = mean(covid))
# A tibble: 1 × 1
  covid_avg
      <dbl>
1     0.428

Nesting

summarize(select(filter(class_survey, 
       coding_exp_scale < 3,
       olympics %in% c("Ice skating", "Speed skating"),
       is_entrepreneur == TRUE),
       coding_exp_scale,
       olympics,
       is_entrepreneur,
       covid),
       covid_avg = mean(covid))

Cons

  • Must be read from inside out
  • Hard to keep track of arguments `

Pros

  • All in one line of code
  • Only refer to one data frame

Step-by-step

df1 <- filter(class_survey, 
              coding_exp_scale < 3,
              olympics %in% c("Ice skating", "Speed skating"),
              is_entrepreneur == TRUE)
df2 <- select(df1, 
              coding_exp_scale,
              olympics,
              is_entrepreneur,
              covid)
summarize(df2,
          covid_avg = mean(covid))

Cons

  • Have to repeat data frame names
  • Creates unnecessary objects

Pros

  • Stores intermediate objects
  • Can be read top to bottom

Using the pipe operator

class_survey |>
  filter(coding_exp_scale < 3,
         olympics %in% c("Ice skating", "Speed skating"),
         is_entrepreneur == TRUE) |>
  select(coding_exp_scale,
         olympics,
         is_entrepreneur,
         covid) |>
  summarize(covid_avg = mean(covid))

Cons

Pros

  • Can be read like an english paragraph
  • Only type the data once
  • No leftovers objects

Understanding your pipeline

It’s good practice to understand the output of each line of code by breaking the pipe.

class_survey |>
  select(covid) |>
  filter(time_at_cal == "It's my first time_at_cal.")
Error in `filter()`:
ℹ In argument: `time_at_cal == "It's my first time_at_cal."`.
Caused by error:
! object 'time_at_cal' not found
class_survey |>
  select(covid)
# A tibble: 816 × 1
   covid
   <dbl>
 1  0   
 2  0.5 
 3  0.6 
 4  0.7 
 5 NA   
 6  0.15
 7  0.7 
 8  0   
 9  0.8 
10 NA   
# ℹ 806 more rows

Concept Question

class_survey |> # A #<<
  filter(coding_exp_scale < 3,
         olympics %in% c("Ice skating", 
                         "Speed skating"),
         is_entrepreneur == TRUE) |> # B #<<
  select(coding_exp_scale,
         olympics,
         is_entrepreneur,
         covid) |> # C #<<
  summarize(covid_avg = mean(covid)) # D #<<
# note
dim(class_survey)
[1] 816  30

What are the dimensions (rows x columns) of the data frames output at each stage of this pipe?

01:00

summarize(class_survey,
          mean(time_at_cal == "I'm in my first year.", na.rm = TRUE))

What is will this line of code return?

01:00

Boolean Algebra

Logical vectors have a dual representation as TRUE FALSE and 1, 0, so you can do math on logicals accordingly.

TRUE + TRUE
[1] 2
TRUE * TRUE
[1] 1

Taking the mean of a logical vector is equivalent to find the proportion of rows that are TRUE (i.e. the proportion of rows that meet the condition).

Worksheet: Conditioning

20:00

Break

05:00

Lab Part I: Flights

25:00