# Learning Objectives

In the Asynchronous Lecture

• What is data wrangling and tidy data?
• Data wrangling with dplyr
• select()ing variables
• filter()ing and arrange()ing tidy data
• mutate()ing data to generate new variables
• group_by() and summarize() data to alter the unit of analysis
• Using the pipe operator to chain together data manipulations

In the Synchronous Lecture

• join()ing data
• Reshaping data with tidyr

If you have any questions while watching the pre-recorded material, be sure to write them down and to bring them up during the synchronous portion of the lecture.

# Asynchronous Materials

The following tabs contain pre-recorded lecture materials for class this week. Please review these materials prior to the synchronous lecture.

Total time: Approx. 1 hour 20 minutes

## Selecting

### Code from the video

require(tidyverse)

# Presidential Data Frame
dat <- presidential
dat

# Selecting Variables -----------------------------------------------------

# Allows us to select variables
select(dat,name,start,party)

# Selecting + Renaming + Reordering Variables ----------------------------------------------------

# Reorder Variables in Data Frame
select(dat,party,name,start,end)

# Rename variables on the fly
select(dat,political_party = party, president = name)

# NOTE there is also a rename() function
rename(dat,political_party = party)

# Reordering via slice operator
select(dat,name:end)

# Dropping Variables -----------------------------------------------------

# Drop 1 variable
select(dat,-start)

# Dropping multiple variables
select(dat,-start,-end)
select(dat,-c(start,end))

# Selection Helper Functions ----------------------------------------------

# contains() - extract columns with a specific naming convention.

select(dat,contains("a"))

# starts_with() - extract columns that start with a specific naming convention.

select(dat,starts_with("n"))

# ends_with() - extract columns that end with a specific naming convention.

select(dat,ends_with("t"))

# everything() - extract every remaining column not yet stated in the selection.

select(dat,party,end,everything())

# matches() - extract columns using a regular expression.

select(dat,matches("^s"))

## Filtering/Arranging

### Code from the video

require(tidyverse)

# Presidential Data Frame
dat <- presidential
dat

# Filtering Rows by Variable Values ----------------------------------------------------------

filter(dat, party == "Republican" )

filter(dat, party == "Democratic" )
filter(dat, party != "Republican" )

# Compound logical statements
filter(dat, party == "Democratic" & start <= "2000-01-01")

# Arranging Rows by Variable Values  --------------------------------------

# Arrange by character
arrange(dat,name)
arrange(dat,party)

# Arrange by date
arrange(dat,start)

# Descending order using desc()
arrange(dat,desc(start))

# Arranging by multiple variables
arrange(dat,party,desc(start))

## Mutating

### Code from the video

require(tidyverse)

# Presidential Data Frame
dat <- presidential
dat

# Mutate to generate new variables ----------------------------------------

mutate(dat,
cold_war = start <= "1989-01-01",
cold_war = as.numeric(cold_war),
republican = as.numeric(party == "Republican"),
interaction = cold_war * republican)

# Transmute ---------------------------------------------------------------

transmute(dat,
cold_war = start <= "1989-01-01",
cold_war = as.numeric(cold_war),
republican = as.numeric(party == "Republican"),
interaction = cold_war * republican)

## Grouping and Summarizing

### Code from the video

require(tidyverse)

# Presidential Data Frame
dat <- presidential
dat

# Summarizing -------------------------------------------------------------

# Aggregate variables using summarize
summarize(dat,
days_in_office = mean(end-start),
max = max(end-start),
min = min(end-start))

# Helpful Summary Functions: n() & n_distinct()
summarize(dat,
N = n(),
N_party = n_distinct(party))

# Group By + Summarizing --------------------------------------------------

# Group by a variable
x <- group_by(dat,party)

# Then summarize by group
summarize(x,min_in_office = min(end-start))

# Group by + Mutate -------------------------------------------------------

x <- group_by(dat,party)

mutate(x,min_in_office = min(end-start))

# Feedback

The following survey asks you quick questions regarding the usefulness of the asynchronous lecture materials. Feedback will be used to modify aspect of the asynchronous materials moving forward.

# Practice

These exercises are designed to help you reinforce your grasp of the concepts covered in the asynchronous lecture material.

For the following practice, let’s use the flights dataset from the nycflights13 package.

# Load the packages
require(tidyverse)
require(nycflights13)

For information on the dataset, please consult the documentation.

?flights

Let’s store the data in an object called dat.

dat <- flights

## Question 1

Tell me about the data:

• How many observations (i.e. flights) are in the data?
• How many carriers are in the data?
• How many different destination airports are there in the data?

### _

# How many observations (i.e. flights) are in the data?
dat %>% count()
## # A tibble: 1 x 1
##        n
##    <int>
## 1 336776
# How many carriers are in the data?
dat %>% summarize(n_distinct(carrier))
## # A tibble: 1 x 1
##   n_distinct(carrier)
##                   <int>
## 1                    16
# How many different origin airports are there in the data?
dat %>% summarize(n_distinct(dest))
## # A tibble: 1 x 1
##   n_distinct(dest)
##                <int>
## 1                105

## Question 2

How many flights are there by month? Which month is the slowest (i.e. which month was there the least number of flights)?

### _

dat %>%
group_by(month) %>%
count() %>%
arrange(n)
## # A tibble: 12 x 2
## # Groups:   month [12]
##    month     n
##    <int> <int>
##  1     2 24951
##  2     1 27004
##  3    11 27268
##  4     9 27574
##  5    12 28135
##  6     6 28243
##  7     4 28330
##  8     5 28796
##  9     3 28834
## 10    10 28889
## 11     8 29327
## 12     7 29425
# Looks like February has the least number of total flights!

## Question 3

Which is the slowest flying day by month?

### _

dat %>%
group_by(month,day) %>%
count() %>%
group_by(month) %>%
filter(n == min(n)) %>%
arrange(month)
## # A tibble: 13 x 3
## # Groups:   month [12]
##    month   day     n
##    <int> <int> <int>
##  1     1    19   674
##  2     2     2   682
##  3     3     2   765
##  4     3     9   765
##  5     4    27   757
##  6     5    25   728
##  7     6     1   754
##  8     7     4   737
##  9     8    31   680
## 10     9    28   682
## 11    10    12   676
## 12    11    28   634
## 13    12     7   691

The following materials were generated for students enrolled in PPOL670. Please do not distribute without permission.

ed769@georgetown.edu | www.ericdunford.com