Numerical and Visual Summaries

STAT 20: Introduction to Probability and Statistics

Agenda

Agenda

  • Announcements

  • Conceptual Review: bringing three reading notes together (mini-lecture/chart)

  • Coding Review

  • Break

  • Concept Questions

  • Work time on assignments

Announcements

Announcements

  • Quiz 1 on Monday. Covers Understanding the World with Data through Summarizing Numerical Data.
  • Problem Set 1 can now be turned in through Friday night for full credit.
  • Lab: Getting Started is due Monday, June 24 at 12pm. Make sure you read the Lab Submission Guidelines posted to Ed.

Concpetual Review

Break

05:00

Concept Questions

Concept Question 1 - Taxonomy of Data

Images as data

Images as data

  • Images are composed of pixels (this image is 1012 by 1520)

  • The color in each pixel is in RGB

  • Each band takes a value from 0-255

  • This image is data with 1020 x 1520 x 3 values.

A shoebill with a duck in its mouth.

Grayscale

  • Grayscale images have only one band
  • 0 is black, 255 is white
  • This image is data with 1020 x 1520 x 1 values.

To simplify, assume our photos are 8 x 8 grayscale images.

A shoebill with a duck in its mouth in grayscale.

Images in a Data Frame

If you were to put the data from these (8 x 8 grayscale) images into a data frame, what would the dimensions of that data frame be in rows x columns? Answer at pollev.com/jeremysanchez.

03:00

Concept Questions 2 and 3 - Summarizing Categorical Data

Concept Question 2a

The table below displays data from a survey on a class of students.

What proportion of the entire class was in the marching band?

01:30

Concept Question 2b

What were the dimensions of the raw data from which this table was constructed? (rows x cols)

01:30

Concept Question 3

Below is a two-variable bar chart describing affiliation and college degree status of 500 survey participants.

Concept Question 3 (cont.)

Based off of the graphic on the previous slide, which group is largest?

  • Democrats with no college degree
  • Democrats with a college degree
  • Republicans with a college degree
  • Republicans without a college degree
02:00

Concept Question 3 (cont.)

  • The regular stacked bar chart of counts preserves original counts and thus is good at comparing joint proportions.
  • The stacked, normalized bar chart shows conditional proportions and thus is good for showing associations between variables.

Concept Activity 4 - Summarizing Numerical Data (Measures of Center)

Mean, median, mode: which is best?

It depends on your desiderata: the nature of your data and what you seek to capture in your summary.

Get out a piece of paper. You’ll be watching a 3 minute video that discusses characteristics of a typical human. Note which numerical summaries are used and what for.

General Advice

  1. Means are often a good default for symmetric data.
  1. Means are sensitive to very large and small values, so can be deceptive on skewed data. > Use a median
  1. Modes are often the only option for categorical data.

But there are other notions of typical, depending on the context.

Wrapup - Summarizing Distributions of Data

  • You can construct a statistical graphic to show the shape, which you can describe in terms of modality and skew
  • you can calculate a measure of center to convey a sense of a typical observation
  • and you can calculate a measure of spread to capture how much variability there is in the data

Free time

05:00