Learning Objectives

In the Asynchronous Lecture

What is data wrangling and tidy data?
Data wrangling with dplyr
- select()ing variables
- filter()ing and arrange()ing tidy data
- mutate()ing data to generate new variables
- group_by() and summarize() data to alter the unit of analysis
Using the pipe operator to chain together data manipulations

In the Synchronous Lecture

join()ing data
Reshaping data with tidyr

If you have any questions while watching the pre-recorded material, be sure to write them down and to bring them up during the synchronous portion of the lecture.

Asynchronous Materials

The following tabs contain pre-recorded lecture materials for class this week. Please review these materials prior to the synchronous lecture.

Total time: Approx. 1 hour 20 minutes

_

Data Wrangling

Relevant Slides

Selecting

Relevant Slides

Code from the video

require(tidyverse)

# Presidential Data Frame
dat <- presidential
dat

# Selecting Variables -----------------------------------------------------

    # Allows us to select variables 
    select(dat,name,start,party)

# Selecting + Renaming + Reordering Variables ----------------------------------------------------

    # Reorder Variables in Data Frame
    select(dat,party,name,start,end)

    # Rename variables on the fly
    select(dat,political_party = party, president = name)
    
    # NOTE there is also a rename() function
    rename(dat,political_party = party)
    
    # Reordering via slice operator 
    select(dat,name:end)


# Dropping Variables -----------------------------------------------------

    # Drop 1 variable
    select(dat,-start)
    
    # Dropping multiple variables
    select(dat,-start,-end)
    select(dat,-c(start,end))


# Selection Helper Functions ----------------------------------------------

    # contains() - extract columns with a specific naming convention.
    
    select(dat,contains("a"))
    
    # starts_with() - extract columns that start with a specific naming convention.
    
    select(dat,starts_with("n"))
    
    # ends_with() - extract columns that end with a specific naming convention.
    
    select(dat,ends_with("t"))
    
    # everything() - extract every remaining column not yet stated in the selection.
    
    select(dat,party,end,everything())
    
    # matches() - extract columns using a regular expression.
    
    select(dat,matches("^s"))

Filtering/Arranging

Relevant Slides

Code from the video

require(tidyverse)

# Presidential Data Frame
dat <- presidential
dat


# Filtering Rows by Variable Values ----------------------------------------------------------

    filter(dat, party == "Republican" )

    filter(dat, party == "Democratic" )
    filter(dat, party != "Republican" )
    
    # Compound logical statements
    filter(dat, party == "Democratic" & start <= "2000-01-01")


# Arranging Rows by Variable Values  --------------------------------------

    # Arrange by character
    arrange(dat,name)
    arrange(dat,party)

    # Arrange by date
    arrange(dat,start)
    
    # Descending order using desc()
    arrange(dat,desc(start))
    
    # Arranging by multiple variables 
    arrange(dat,party,desc(start))

Mutating

Relevant Slides

Code from the video

require(tidyverse)

# Presidential Data Frame
dat <- presidential
dat

# Mutate to generate new variables ----------------------------------------
  
    mutate(dat, 
           cold_war = start <= "1989-01-01",
           cold_war = as.numeric(cold_war),
           republican = as.numeric(party == "Republican"),
           interaction = cold_war * republican)



# Transmute ---------------------------------------------------------------

    transmute(dat, 
              cold_war = start <= "1989-01-01",
              cold_war = as.numeric(cold_war),
              republican = as.numeric(party == "Republican"),
              interaction = cold_war * republican)

Grouping and Summarizing

Relevant Slides

Code from the video

require(tidyverse)

# Presidential Data Frame
dat <- presidential
dat

# Summarizing -------------------------------------------------------------

    # Aggregate variables using summarize
    summarize(dat,
              days_in_office = mean(end-start),
              max = max(end-start),
              min = min(end-start))

    # Helpful Summary Functions: n() & n_distinct()
    summarize(dat,
              N = n(),
              N_party = n_distinct(party))
    

# Group By + Summarizing --------------------------------------------------

    # Group by a variable
    x <- group_by(dat,party)
    
    # Then summarize by group
    summarize(x,min_in_office = min(end-start))
    
    

# Group by + Mutate -------------------------------------------------------

    x <- group_by(dat,party)
    
    mutate(x,min_in_office = min(end-start))

Piping

Relevant Slides

[Bonus/Optional] Other Useful `dplyr` Functions

Relevant Slides

Feedback

The following survey asks you quick questions regarding the usefulness of the asynchronous lecture materials. Feedback will be used to modify aspect of the asynchronous materials moving forward.

Practice

These exercises are designed to help you reinforce your grasp of the concepts covered in the asynchronous lecture material.

For the following practice, let’s use the flights dataset from the nycflights13 package.

# Load the packages
require(tidyverse)
require(nycflights13)

For information on the dataset, please consult the documentation.

?flights

Let’s store the data in an object called dat.

dat <- flights

_

Question 1

Tell me about the data:

How many observations (i.e. flights) are in the data?
How many carriers are in the data?
How many different destination airports are there in the data?

_

Answer

# How many observations (i.e. flights) are in the data?
dat %>% count()

## # A tibble: 1 x 1
##        n
##    <int>
## 1 336776

# How many carriers are in the data?
dat %>% summarize(n_distinct(carrier))

## # A tibble: 1 x 1
##   `n_distinct(carrier)`
##                   <int>
## 1                    16

# How many different `origin` airports are there in the data?
dat %>% summarize(n_distinct(dest))

## # A tibble: 1 x 1
##   `n_distinct(dest)`
##                <int>
## 1                105

Question 2

How many flights are there by month? Which month is the slowest (i.e. which month was there the least number of flights)?

_

Answer

dat %>% 
  group_by(month) %>% 
  count() %>% 
  arrange(n)

## # A tibble: 12 x 2
## # Groups:   month [12]
##    month     n
##    <int> <int>
##  1     2 24951
##  2     1 27004
##  3    11 27268
##  4     9 27574
##  5    12 28135
##  6     6 28243
##  7     4 28330
##  8     5 28796
##  9     3 28834
## 10    10 28889
## 11     8 29327
## 12     7 29425

# Looks like February has the least number of total flights!

Question 3

Which is the slowest flying day by month?

_

Answer

dat %>% 
  group_by(month,day) %>% 
  count() %>% 
  group_by(month) %>% 
  filter(n == min(n)) %>% 
  arrange(month)

## # A tibble: 13 x 3
## # Groups:   month [12]
##    month   day     n
##    <int> <int> <int>
##  1     1    19   674
##  2     2     2   682
##  3     3     2   765
##  4     3     9   765
##  5     4    27   757
##  6     5    25   728
##  7     6     1   754
##  8     7     4   737
##  9     8    31   680
## 10     9    28   682
## 11    10    12   676
## 12    11    28   634
## 13    12     7   691

The following materials were generated for students enrolled in PPOL670. Please do not distribute without permission.

ed769@georgetown.edu | www.ericdunford.com

Back to Course Website

Data Wrangling

PPOL 670 | Introduction to Data Science

Lecture Materials for Week 4

Professor Eric Dunford (ed769@georgetown.edu)
McCourt School of Public Policy, Georgetown University

Learning Objectives

Asynchronous Materials

_

Data Wrangling

Relevant Slides

Selecting

Relevant Slides

Code from the video

Filtering/Arranging

Relevant Slides

Code from the video

Mutating

Relevant Slides

Code from the video

Grouping and Summarizing

Relevant Slides

Code from the video

Piping

Relevant Slides

[Bonus/Optional] Other Useful `dplyr` Functions

Relevant Slides

Feedback

Practice

_

Question 1

_

Answer

Question 2

_

Answer

Question 3

_

Answer

Back to Course Website Data Wrangling

PPOL 670 | Introduction to Data Science Lecture Materials for Week 4

Professor Eric Dunford (ed769@georgetown.edu) McCourt School of Public Policy, Georgetown University

Learning Objectives

Asynchronous Materials

_

Data Wrangling

Selecting

Code from the video

Filtering/Arranging

Code from the video

Mutating

Code from the video

Grouping and Summarizing

Code from the video

Piping

[Bonus/Optional] Other Useful dplyr Functions

Practice

_

Question 1

_

Answer

Question 2

_

Answer

Question 3

_

Answer

Back to Course Website

Data Wrangling

PPOL 670 | Introduction to Data Science

Lecture Materials for Week 4

Professor Eric Dunford (ed769@georgetown.edu)
McCourt School of Public Policy, Georgetown University

[Bonus/Optional] Other Useful `dplyr` Functions