Taxonomy of Data

STAT 20: Introduction to Probability and Statistics

Agenda

  • Reading Questions: Taxonomy of Data
  • Break
  • Worksheet: Taxonomy of Data
  • Break
  • R Workshop
  • Appendix

Reading Questions

  • Please put your laptops under your desk and your phones away.
  • Write your name, ID, and bubble in Version “A” on your answer sheet.
  • You may work only with those at your table!

Not counting the course_num, how many variables are being recorded here?

  • A. 3
  • B. 4
  • C. 5
  • D. 6
  • E. 7
00:30

What type of variable is the number of students?

  • A. categorical, ordinal
  • B. categorical, nominal
  • C. numerical, discrete
  • D. numerical, continuous
00:30

What is the most specific term that can be given to this data structure?

  • A. spreadsheet
  • B. table
  • C. contingency table
  • D. data frame
00:30

What is the statistics term used for a property of an object that can be measured and recorded?

  • A. data

  • B. variable

  • C. constant

  • D. observational unit

00:20

Which of the following best describes the distinction between R and RStudio?

  • A. They are synonyms for the same thing.

  • B. R is the language; RStudio is the tool we’ll use to run and write code.

  • C. RStudio is the newer edition of the R language.

00:30

How do you save the output of an R command into your Environment?

  • A. Wrap the command in parentheses ().

  • B. Wrap the commend in curly braces {}.

  • C. Use the assignment arrow and provide a name for your output to the environment.

  • D. Drag and drop the command into your environment.

00:30

The atomic building block of the R language is the …

  • A. vector

  • B. variable

  • C. numeric

  • D. factor

  • E. data frame

00:20

Break

05:00

Worksheet: Taxonomy of Data

  • Keep your laptops under your desk for now or close them on your desk
  • Work in pairs or in groups of three!
25:00

A note on variables

  • Depending on how a variable is recorded, it can take different types!

What type of variable is age? (Part 1)

  • Age groups of television audiences/demographics

Answer at pollev.com.

00:15

What type of variable is age? (Part 2)

  • Ages of patients at a doctor’s office, as they might fill out on an intake form

Answer at pollev.com.

00:15

Break

05:00

R Workshop

Working in a qmd file

Working in a new .qmd file allows you to save your code for later.

Demo

  1. Create a new qmd file from the RStudio menu, name it, and save it.
  2. Insert a new code cell.
  3. Write your code into the cell.
  4. Render the document.

Coding tutorial: Taxonomy of Data

  • Download the taxonomy-tutorial.qmd file in the Taxonomy of Data Ed Thread.

  • Click the “Upload” button in the bottom right corner of RStudio and upload the file.

25:00

R Workshop

  • Time to make a series of educated guesses. Close your laptops!

Educated Guess 1

What will happen here?


Answer at pollev.com/<name>


1 + "one"
00:30

Educated Guess 2

What will happen here?


Answer at pollev.com/<name>


a <- c(1, 2, 3, 4)
sqrt(log(a))
01:00

Educated Guess 3

What will happen here?


Answer at pollev.com/<name>


a <- 1 + 2
a + 1
01:00

Educated Guess 4

What will happen here?


Answer at pollev.com/<name>


a <- c(1, 3.14, "seven")
class(a)
01:00

End of Lecture

Appendix - more practice!

There’s no escape from the shoebill..

Images as data

  • Images are composed of pixels (this image is 1520 by 1012)

  • The color in each pixel is in RGB

  • Each band takes a value from 0-255

  • This image is data with 1520 x 1012 x 3 values.

A shoebill with a duck in its mouth.

Grayscale

  • Grayscale images have only one band
  • 0 is black, 255 is white
  • This image is data with 1520 x 1012 x 1 values.

A shoebill with a duck in its mouth in grayscale.

Grayscale

  • To simplify, assume our photos are 8 x 8 grayscale images.

An 8 x 8 grayscale image

Images in a Data Frame

Consider the following images which are our data:

  • Let’s simplify them to 8 x 8 grayscale images

Images in a Data Frame

If you were to put the data from these (8 x 8 grayscale) images into a data frame, what would the dimensions of that data frame be in rows x columns?

Functions on vectors

A vector is the simplest structure used in R to store data. It can be created using the function c().

my_vector <- c(1, 3, 4)
my_vector
[1] 1 3 4

A function operates on an R object and produces output. R has many of the mathematical functions that you would expect.

sum(my_vector)
[1] 8

Your Turn

  1. Create a vector named vec with the even integers between 1 and 10 as well as the number 99 (six elements total).

  2. Find the sum of that vector.

  3. Find the max of that vector.

  4. Take the mean of that vector and round it to the nearest integer.

These should all be solved with R code. If you don’t know the name of a function to use, you could hazard a guess by looking for a help file (e.g. ?sum) or google it.

Building a data frame

You can combine vectors into a data frame using data.frame()1

bill_depth_mm <- c(15.0, 17.1, 18.7, 18.9)
bill_length_mm <- c(47.5, 40.2, 39.0, 35.3)
species <- c("Gentoo", "Adelie", "Adelie", "Adelie")


penguins_df <- data.frame(bill_depth_mm, bill_length_mm, species)
penguins_df
  bill_depth_mm bill_length_mm species
1          15.0           47.5  Gentoo
2          17.1           40.2  Adelie
3          18.7           39.0  Adelie
4          18.9           35.3  Adelie

Your Turn

  1. Create a new .qmd file, name it, and save it.

  2. Insert a new code cell.

  3. Create three vectors, name, hometown, and sibs_and_pets that contain observations on those variables from 6 people in this class.

  4. Combine them into a data frame called my_classmates.