In the Asynchronous Lecture
dplyr
select()
ing variablesfilter()
ing and arrange()
ing tidy datamutate()
ing data to generate new variablesgroup_by()
and summarize()
data to alter the unit of analysispipe
operator to chain together data manipulationsIn the Synchronous Lecture
join()
ing datatidyr
If you have any questions while watching the pre-recorded material, be sure to write them down and to bring them up during the synchronous portion of the lecture.
The following tabs contain pre-recorded lecture materials for class this week. Please review these materials prior to the synchronous lecture.
Total time: Approx. 1 hour 20 minutes
require(tidyverse)
# Presidential Data Frame
dat <- presidential
dat
# Selecting Variables -----------------------------------------------------
# Allows us to select variables
select(dat,name,start,party)
# Selecting + Renaming + Reordering Variables ----------------------------------------------------
# Reorder Variables in Data Frame
select(dat,party,name,start,end)
# Rename variables on the fly
select(dat,political_party = party, president = name)
# NOTE there is also a rename() function
rename(dat,political_party = party)
# Reordering via slice operator
select(dat,name:end)
# Dropping Variables -----------------------------------------------------
# Drop 1 variable
select(dat,-start)
# Dropping multiple variables
select(dat,-start,-end)
select(dat,-c(start,end))
# Selection Helper Functions ----------------------------------------------
# contains() - extract columns with a specific naming convention.
select(dat,contains("a"))
# starts_with() - extract columns that start with a specific naming convention.
select(dat,starts_with("n"))
# ends_with() - extract columns that end with a specific naming convention.
select(dat,ends_with("t"))
# everything() - extract every remaining column not yet stated in the selection.
select(dat,party,end,everything())
# matches() - extract columns using a regular expression.
select(dat,matches("^s"))
require(tidyverse)
# Presidential Data Frame
dat <- presidential
dat
# Filtering Rows by Variable Values ----------------------------------------------------------
filter(dat, party == "Republican" )
filter(dat, party == "Democratic" )
filter(dat, party != "Republican" )
# Compound logical statements
filter(dat, party == "Democratic" & start <= "2000-01-01")
# Arranging Rows by Variable Values --------------------------------------
# Arrange by character
arrange(dat,name)
arrange(dat,party)
# Arrange by date
arrange(dat,start)
# Descending order using desc()
arrange(dat,desc(start))
# Arranging by multiple variables
arrange(dat,party,desc(start))
require(tidyverse)
# Presidential Data Frame
dat <- presidential
dat
# Mutate to generate new variables ----------------------------------------
mutate(dat,
cold_war = start <= "1989-01-01",
cold_war = as.numeric(cold_war),
republican = as.numeric(party == "Republican"),
interaction = cold_war * republican)
# Transmute ---------------------------------------------------------------
transmute(dat,
cold_war = start <= "1989-01-01",
cold_war = as.numeric(cold_war),
republican = as.numeric(party == "Republican"),
interaction = cold_war * republican)
require(tidyverse)
# Presidential Data Frame
dat <- presidential
dat
# Summarizing -------------------------------------------------------------
# Aggregate variables using summarize
summarize(dat,
days_in_office = mean(end-start),
max = max(end-start),
min = min(end-start))
# Helpful Summary Functions: n() & n_distinct()
summarize(dat,
N = n(),
N_party = n_distinct(party))
# Group By + Summarizing --------------------------------------------------
# Group by a variable
x <- group_by(dat,party)
# Then summarize by group
summarize(x,min_in_office = min(end-start))
# Group by + Mutate -------------------------------------------------------
x <- group_by(dat,party)
mutate(x,min_in_office = min(end-start))
dplyr
FunctionsThe following survey asks you quick questions regarding the usefulness of the asynchronous lecture materials. Feedback will be used to modify aspect of the asynchronous materials moving forward.
These exercises are designed to help you reinforce your grasp of the concepts covered in the asynchronous lecture material.
For the following practice, let’s use the flights
dataset from the nycflights13
package.
For information on the dataset, please consult the documentation.
Let’s store the data in an object called dat
.
Tell me about the data:
carrier
s are in the data?dest
ination airports are there in the data?## # A tibble: 1 x 1
## n
## <int>
## 1 336776
## # A tibble: 1 x 1
## `n_distinct(carrier)`
## <int>
## 1 16
## # A tibble: 1 x 1
## `n_distinct(dest)`
## <int>
## 1 105
How many flights are there by month? Which month is the slowest (i.e. which month was there the least number of flights)?
## # A tibble: 12 x 2
## # Groups: month [12]
## month n
## <int> <int>
## 1 2 24951
## 2 1 27004
## 3 11 27268
## 4 9 27574
## 5 12 28135
## 6 6 28243
## 7 4 28330
## 8 5 28796
## 9 3 28834
## 10 10 28889
## 11 8 29327
## 12 7 29425
Which is the slowest flying day by month?
dat %>%
group_by(month,day) %>%
count() %>%
group_by(month) %>%
filter(n == min(n)) %>%
arrange(month)
## # A tibble: 13 x 3
## # Groups: month [12]
## month day n
## <int> <int> <int>
## 1 1 19 674
## 2 2 2 682
## 3 3 2 765
## 4 3 9 765
## 5 4 27 757
## 6 5 25 728
## 7 6 1 754
## 8 7 4 737
## 9 8 31 680
## 10 9 28 682
## 11 10 12 676
## 12 11 28 634
## 13 12 7 691
The following materials were generated for students enrolled in PPOL670. Please do not distribute without permission.
ed769@georgetown.edu | www.ericdunford.com