Conditioning

STAT 20: Introduction to Probability and Statistics

Agenda

Announcements
Reading Questions: Conditioning
Break
Worksheet: Conditioning
Break
Lab 2: Flights

Announcements

You are allowed one, one-sided, handwritten cheatsheet for the Quiz on Thursday.

Portfolio 2 due Thursday, not Friday, at 8pm.

Reading Questions

Please put your laptops under your desk and your phones away.
Write your name, ID, and bubble in Version “A” on your answer sheet.
You may work only with those at your table!

Which of the following pieces of code will not cause an error?

penguins 
  |> mutate(bill_size = bill_len * bill_dep) 
  |> select(bill_size)

penguins |> 
  mutate(bill_size = bill_len * bill_dep) |>
  select(bill_size)

penguins |> 
  mutate(penguins, bill_size = bill_len * bill_dep) |>
  select(penguins, bill_size)

00:45

Which option describes a filter operation?

A: Subsetting the rows of a data frame according to their position.
B: Subsetting the columns of a data frame based on their names.
C: Subsetting the rows of a data frame based on their values of particular variables.

00:30

What will the following command return?

mean(c(TRUE, TRUE, TRUE, FALSE))

A TRUE
B FALSE
C: An error will be produced
D: 0.75
E: 0.25

00:40

Which of the following lines of code correctly extracts rows from the `penguins` data frame that are of the desired species?

A: filter(penguins, species %in% c("Adelie", "Chinstrap"))
B: filter(penguins, species == c("Adelie", "Chinstrap"))
C: filter(penguins, species = c("Adelie", "Chinstrap"))
D: slice(penguins, species %in% c("Adelie", "Chinstrap"))
E: select(penguins, species %in% c("Adelie", "Chinstrap"))

00:30

Break

05:00

Worksheet: Conditioning

30:00

Break

05:00

Lab 2: Flights

30:00

Appendix - more practice!

c("fruit", "fruit", "vegetable") == "fruit"

What will this line of code return?

01:00

Evaluating equivalence, cont.

In R, this evaluation happens element-wise when operating on vectors.

c("fruit", "fruit", "vegetable") == "fruit"

[1]  TRUE  TRUE FALSE

c("fruit", "fruit", "vegetable") != "fruit"

[1] FALSE FALSE  TRUE

c("fruit", "vegetable", "boba") %in% c("fruit", "vegetable")

[1]  TRUE  TRUE FALSE

Question 2

class_survey |>
  filter(coding_exp_scale < 3,
         olympics %in% c("Ice skating", "Speed skating"),
         is_entrepreneur == TRUE)

Which observations will be included in the following data frame?

01:00

Question 3

Which data frame will have fewer rows?

# A
filter(class_survey, time_at_cal == "This is my first semester!")

# B
class_survey |>
  mutate(first_sem = (time_at_cal == "This is my first semester!")) |>
  filter(first_sem)

01:00

Building data pipelines

Consider the subset of students here:

class_survey |>
  filter(coding_exp_scale < 3,
         olympics %in% c("Ice skating", "Speed skating"),
         is_entrepreneur == TRUE)

How do we extract the average of these students’ chance that class will be disrupted by a new COVID variant?

Let’s look at three different ways to answer this question

Nesting

filter(class_survey, 
       coding_exp_scale < 3,
       olympics %in% c("Ice skating", "Speed skating"),
       is_entrepreneur == TRUE)

Nesting

select(filter(class_survey, 
       coding_exp_scale < 3,
       olympics %in% c("Ice skating", "Speed skating"),
       is_entrepreneur == TRUE),
       coding_exp_xcale,
       olympics,
       is_entrepreneur,
       covid)

Nesting

summarize(select(filter(class_survey, 
       coding_exp_scale < 3,
       olympics %in% c("Ice skating", "Speed skating"),
       is_entrepreneur == TRUE),
       coding_exp_scale,
       olympics,
       is_entrepreneur,
       covid),
       covid_avg = mean(covid))

Nesting

summarize(select(filter(class_survey, 
       coding_exp_scale < 3,
       olympics %in% c("Ice skating", "Speed skating"),
       is_entrepreneur == TRUE),
       coding_exp_scale,
       olympics,
       is_entrepreneur,
       covid),
       covid_avg = mean(covid))

# A tibble: 1 × 1
  covid_avg
      <dbl>
1     0.428

Nesting

summarize(select(filter(class_survey, 
       coding_exp_scale < 3,
       olympics %in% c("Ice skating", "Speed skating"),
       is_entrepreneur == TRUE),
       coding_exp_scale,
       olympics,
       is_entrepreneur,
       covid),
       covid_avg = mean(covid))

Cons

Must be read from inside out
Hard to keep track of arguments `

Pros

All in one line of code
Only refer to one data frame

Step-by-step

df1 <- filter(class_survey, 
              coding_exp_scale < 3,
              olympics %in% c("Ice skating", "Speed skating"),
              is_entrepreneur == TRUE)
df2 <- select(df1, 
              coding_exp_scale,
              olympics,
              is_entrepreneur,
              covid)
summarize(df2,
          covid_avg = mean(covid))

Cons

Have to repeat data frame names
Creates unnecessary objects

Pros

Stores intermediate objects
Can be read top to bottom

Using the pipe operator

class_survey |>
  filter(coding_exp_scale < 3,
         olympics %in% c("Ice skating", "Speed skating"),
         is_entrepreneur == TRUE) |>
  select(coding_exp_scale,
         olympics,
         is_entrepreneur,
         covid) |>
  summarize(covid_avg = mean(covid))

Cons

Pros

Can be read like an english paragraph
Only type the data once
No leftovers objects

Understanding your pipeline

It’s good practice to understand the output of each line of code by breaking the pipe.

class_survey |>
  select(covid) |>
  filter(time_at_cal == "It's my first time_at_cal.")

Error in `filter()`:
ℹ In argument: `time_at_cal == "It's my first time_at_cal."`.
Caused by error:
! object 'time_at_cal' not found

class_survey |>
  select(covid)

# A tibble: 816 × 1
   covid
   <dbl>
 1  0   
 2  0.5 
 3  0.6 
 4  0.7 
 5 NA   
 6  0.15
 7  0.7 
 8  0   
 9  0.8 
10 NA   
# ℹ 806 more rows

Concept Question

class_survey |> # A #<<
  filter(coding_exp_scale < 3,
         olympics %in% c("Ice skating", 
                         "Speed skating"),
         is_entrepreneur == TRUE) |> # B #<<
  select(coding_exp_scale,
         olympics,
         is_entrepreneur,
         covid) |> # C #<<
  summarize(covid_avg = mean(covid)) # D #<<

# note
dim(class_survey)

[1] 816  30

What are the dimensions (rows x columns) of the data frames output at each stage of this pipe?

01:00

summarize(class_survey,
          mean(time_at_cal == "I'm in my first year.", na.rm = TRUE))

What is will this line of code return?

01:00

Boolean Algebra

Logical vectors have a dual representation as TRUE FALSE and 1, 0, so you can do math on logicals accordingly.

TRUE + TRUE

[1] 2

TRUE * TRUE

[1] 1

Taking the mean of a logical vector is equivalent to find the proportion of rows that are TRUE (i.e. the proportion of rows that meet the condition).

Worksheet: Conditioning

20:00

Break

05:00

Lab Part I: Flights

25:00

Conditioning

Agenda

Announcements

Reading Questions

Which of the following pieces of code will not cause an error?

Which option describes a filter operation?

What will the following command return?

Which of the following lines of code correctly extracts rows from the penguins data frame that are of the desired species?

Break

Worksheet: Conditioning

Break

Lab 2: Flights

Appendix - more practice!

Evaluating equivalence, cont.

Question 2

Question 3

Building data pipelines

Nesting

Nesting

Nesting

Nesting

Nesting

Step-by-step

Using the pipe operator

Understanding your pipeline

Concept Question

Boolean Algebra

Worksheet: Conditioning

Break

Lab Part I: Flights

Which of the following lines of code correctly extracts rows from the `penguins` data frame that are of the desired species?