Summarizing Numerical Data

STAT 20: Introduction to Probability and Statistics

Agenda

  • Announcements
  • Reading Questions
  • Break
  • Worksheet: Summarizing Numerical Data
  • R Workshop: Summarizing Numerical Data
  • Appendix: More practice!

Announcements

  • Group tutoring is today in Evans 340 from 5-7pm.
  • Lab 1 and Portfolio 1 are due tomorrow at 8pm.

Reading Questions

  • Please put your laptops under your desk and your phones away.
  • Write your name, ID, and bubble in Version “A” on your answer sheet.
  • You may work only with those at your table!

Which of the following plot types for numerical variables maintain all of the information found in the original data set?

  • A. dot plot
  • B. histogram
  • C. violin plot
  • D. box plot
00:30

If you wish to see less detail in your histogram and perform more aggregation, which of the following is the best course of action?

  • A. switch to a dot plot
  • B. switch to a bar chart
  • C. instead of presenting the histogram, display the original data frame with the raw data
  • D. increase the bin width of the histogram
  • E. decrease the bin width of the histogram
00:30

Which word best describes a distribution with a long tail stretching out to the left?

  • A. bimodal

  • B. unimodal

  • C. left skewed

  • D. right skewed

00:30

How many more columns will the output from the second line of code have than the first?

summarise(penguins, 
          body_mass_median = median(body_mass_g),
          body_mass_IQR = IQR(body_mass_g))
grouped_penguins <- group_by(penguins, species)

summarise(grouped_penguins, 
          body_mass_median = median(body_mass_g),
          body_mass_IQR = IQR(body_mass_g))
  • A. None
  • B. 1
  • C. 2
  • D. 3
01:00

Before making a violin plot using ggplot2, how can we determine the order of the violins?

  • A. By using select().

  • B. By using mutate() with factor().

  • C. By using group_by() and summarize().

  • D. By using data.frame().

00:30

Break

05:00

Worksheet: Summarizing Numerical Data

Mean, median, mode: which is best?

It depends on the nature of your data and what you seek to capture in your summary.

Get out your worksheet. You’ll be watching a 3 minute video that discusses characteristics of a typical human. Note which numerical summaries are used and what for.

Worksheet: Summarizing Numerical Data

25:00

R Workshop: Summarizing Numerical Data

25:00

End of Lecture

Appendix - More practice!

Describing Shape

Which of these variables do you expect to be uniformly distributed?

  1. bill length of Gentoo penguins
  2. salaries of a random sample of people from California
  3. house sale prices in San Francisco
  4. birthdays of classmates (day of the month)

Please vote at pollev.com.

01:00

General Advice - Measures of Center

  1. Means are often a good default for symmetric data.
  1. Means are sensitive to very large and small values, so can be deceptive on skewed data. > Use a median
  1. Modes are often the only option for categorical data.

But there are other notions of typical… what about a maximum?

Concept Question 3 - Measures of Spread

  • Why are measures of spread so important? Consider the following question.

There are two new food delivery services that open in Berkeley: Oski Eats and Cal Cravings. A friend of yours that took Stat 20 collected data on each and noted that Oski Eats has a mean delivery time of 29 minutes and Cal Cravings a mean delivery time of 27 minutes. Which would would you rather order from?

One possible reality

Which would would you rather order from?

01:00