Learning Objectives


In the Asynchronous Lecture


In the Synchronous Lecture


If you have any questions while watching the pre-recorded material, be sure to write them down and to bring them up during the synchronous portion of the lecture.




Asynchronous Materials


The following tabs contain pre-recorded lecture materials for class this week. Please review these materials prior to the synchronous lecture.

Total time: Approx. 1 hour 20 minutes


_




Data Wrangling

Selecting

Code from the video

require(tidyverse)

# Presidential Data Frame
dat <- presidential
dat

# Selecting Variables -----------------------------------------------------

    # Allows us to select variables 
    select(dat,name,start,party)

# Selecting + Renaming + Reordering Variables ----------------------------------------------------

    # Reorder Variables in Data Frame
    select(dat,party,name,start,end)

    # Rename variables on the fly
    select(dat,political_party = party, president = name)
    
    # NOTE there is also a rename() function
    rename(dat,political_party = party)
    
    # Reordering via slice operator 
    select(dat,name:end)


# Dropping Variables -----------------------------------------------------

    # Drop 1 variable
    select(dat,-start)
    
    # Dropping multiple variables
    select(dat,-start,-end)
    select(dat,-c(start,end))


# Selection Helper Functions ----------------------------------------------

    # contains() - extract columns with a specific naming convention.
    
    select(dat,contains("a"))
    
    # starts_with() - extract columns that start with a specific naming convention.
    
    select(dat,starts_with("n"))
    
    # ends_with() - extract columns that end with a specific naming convention.
    
    select(dat,ends_with("t"))
    
    # everything() - extract every remaining column not yet stated in the selection.
    
    select(dat,party,end,everything())
    
    # matches() - extract columns using a regular expression.
    
    select(dat,matches("^s"))



Filtering/Arranging

Code from the video

require(tidyverse)

# Presidential Data Frame
dat <- presidential
dat


# Filtering Rows by Variable Values ----------------------------------------------------------

    filter(dat, party == "Republican" )

    filter(dat, party == "Democratic" )
    filter(dat, party != "Republican" )
    
    # Compound logical statements
    filter(dat, party == "Democratic" & start <= "2000-01-01")


# Arranging Rows by Variable Values  --------------------------------------

    # Arrange by character
    arrange(dat,name)
    arrange(dat,party)

    # Arrange by date
    arrange(dat,start)
    
    # Descending order using desc()
    arrange(dat,desc(start))
    
    # Arranging by multiple variables 
    arrange(dat,party,desc(start))



Mutating

Code from the video

require(tidyverse)

# Presidential Data Frame
dat <- presidential
dat

# Mutate to generate new variables ----------------------------------------
  
    mutate(dat, 
           cold_war = start <= "1989-01-01",
           cold_war = as.numeric(cold_war),
           republican = as.numeric(party == "Republican"),
           interaction = cold_war * republican)



# Transmute ---------------------------------------------------------------

    transmute(dat, 
              cold_war = start <= "1989-01-01",
              cold_war = as.numeric(cold_war),
              republican = as.numeric(party == "Republican"),
              interaction = cold_war * republican)



Grouping and Summarizing

Code from the video

require(tidyverse)

# Presidential Data Frame
dat <- presidential
dat

# Summarizing -------------------------------------------------------------

    # Aggregate variables using summarize
    summarize(dat,
              days_in_office = mean(end-start),
              max = max(end-start),
              min = min(end-start))

    # Helpful Summary Functions: n() & n_distinct()
    summarize(dat,
              N = n(),
              N_party = n_distinct(party))
    

# Group By + Summarizing --------------------------------------------------

    # Group by a variable
    x <- group_by(dat,party)
    
    # Then summarize by group
    summarize(x,min_in_office = min(end-start))
    
    

# Group by + Mutate -------------------------------------------------------

    x <- group_by(dat,party)
    
    mutate(x,min_in_office = min(end-start))



[Bonus/Optional] Other Useful dplyr Functions

Feedback

The following survey asks you quick questions regarding the usefulness of the asynchronous lecture materials. Feedback will be used to modify aspect of the asynchronous materials moving forward.




Practice


These exercises are designed to help you reinforce your grasp of the concepts covered in the asynchronous lecture material.

For the following practice, let’s use the flights dataset from the nycflights13 package.

# Load the packages
require(tidyverse)
require(nycflights13)

For information on the dataset, please consult the documentation.

?flights

Let’s store the data in an object called dat.

dat <- flights


_

Question 1


Tell me about the data:

  • How many observations (i.e. flights) are in the data?
  • How many carriers are in the data?
  • How many different destination airports are there in the data?


_

Answer

# How many observations (i.e. flights) are in the data?
dat %>% count()
## # A tibble: 1 x 1
##        n
##    <int>
## 1 336776
# How many carriers are in the data?
dat %>% summarize(n_distinct(carrier))
## # A tibble: 1 x 1
##   `n_distinct(carrier)`
##                   <int>
## 1                    16
# How many different `origin` airports are there in the data?
dat %>% summarize(n_distinct(dest))
## # A tibble: 1 x 1
##   `n_distinct(dest)`
##                <int>
## 1                105

Question 2


How many flights are there by month? Which month is the slowest (i.e. which month was there the least number of flights)?


_

Answer

dat %>% 
  group_by(month) %>% 
  count() %>% 
  arrange(n)
## # A tibble: 12 x 2
## # Groups:   month [12]
##    month     n
##    <int> <int>
##  1     2 24951
##  2     1 27004
##  3    11 27268
##  4     9 27574
##  5    12 28135
##  6     6 28243
##  7     4 28330
##  8     5 28796
##  9     3 28834
## 10    10 28889
## 11     8 29327
## 12     7 29425
# Looks like February has the least number of total flights!

Question 3


Which is the slowest flying day by month?


_

Answer

dat %>% 
  group_by(month,day) %>% 
  count() %>% 
  group_by(month) %>% 
  filter(n == min(n)) %>% 
  arrange(month)
## # A tibble: 13 x 3
## # Groups:   month [12]
##    month   day     n
##    <int> <int> <int>
##  1     1    19   674
##  2     2     2   682
##  3     3     2   765
##  4     3     9   765
##  5     4    27   757
##  6     5    25   728
##  7     6     1   754
##  8     7     4   737
##  9     8    31   680
## 10     9    28   682
## 11    10    12   676
## 12    11    28   634
## 13    12     7   691
 

The following materials were generated for students enrolled in PPOL670. Please do not distribute without permission.

ed769@georgetown.edu | www.ericdunford.com