class: center, middle, inverse, title-slide #
PPOL561 | Accelerated Statistics for Public Policy II
Week 2
Data Wrangling & Presentation
###
Prof. Eric Dunford ◆ Georgetown University ◆ McCourt School of Public Policy ◆
eric.dunford@georgetown.edu
--- layout: true <div class="slide-footer"><span> PPOL561 | Accelerated Statistics for Public Policy II           Week 2 <!-- Week of the Footer Here -->              Wrangling & Presentation <!-- Title of the lecture here --> </span></div> --- class: outline # Outline for Today Cover how to - **_reading in_** - **_manipulating_** - **_piping_** - **_joining_** - **_reshaping_** - **_visualizing_** ...data in `R`. Plus, a brief discussion on **_generating tables_** for model output. --- class: newsection # Data Wrangling --- # What is data wrangling? ![:space 5] - **raw → processed**: the process of transforming data from one format to another. - **converting the structure** to facilitate some analysis + **_altering the unit of analysis_**: going from individuals in a state in a given year to state-year by + changing from a **_"wide"_** (many columns, few rows) **_to a "long" structure_** (few colums, many rows) + **_summarizing data_** across specific subgroups --- ## `tidyverse` approach ![:space 5] We are going to cover the basics of data manipulation and visualization in `R`. By focusing on a suite of packages known as the "[tidyverse](https://www.tidyverse.org)". These packages were designed to ease the process of data manipulation and management so that it is more intuitive, efficient, and interpretable. Specifically, the `tidyverse` is a housing package that holds the following packages: - [readr](http://readr.tidyverse.org/) - for reading data in - [tibble](https://tibble.tidyverse.org/) - for "tidy" data structures - [dplyr](http://dplyr.tidyverse.org/) - for data manipulation - [ggplot2](http://ggplot2.tidyverse.org/) - for data visualization - [tidyr](http://tidyr.tidyverse.org/) - for cleaning - [purrr](http://purrr.tidyverse.org/) - functional programming toolkit --- ## `tidyverse` advantage ![:space 15] Most of everything we will do will require in data manipulation just **one** package ```r # Install tidyverse package install.packages('tidyverse') # Load the package require(tidyverse) ``` --- class: newsection # Reading & Writing Data --- # Data Packages ![:space 5] We can import a large variety of data file types into the `R` programming environment. We are going to focus on **three packages** to import different (but common) data types: - `readr` --- an expansive array of functions to read different data types - `readxl` --- for excel spreadsheets - `haven` --- for SPSS, SAS, and .dta ```r require(readr) # Imported with tidyverse require(readxl) require(haven) ``` --- # Importing/Exporting ![:space 15] .center[ | ![:text_size 6](File type) | ![:text_size 6](package) | ![:text_size 6](read) | ![:text_size 6](write) | |-----------|---------|--------------|---------------| | `.csv` | `readr` | `read_csv()` | `write_csv()` | | `.dta` | `haven` | `read_dta()` | `write_dta()` | | `.xlsx` | `readxl` | `read_excel()` | `write_excel()` | | `.Rdata` | Base `R` | `load()` | `save()` | | `.rds` | `readr` | `read_rds()` | `write_rds()` | | `.tab` | `readr` | `read_tsv()` | `write_tsv()` | | generic | `readr` | `read_table()` | `write_table()` | ] --- ## `tibble()` data frames ![:space 10] Differences between `tibbles` and `data.frames` 1. Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data. 2. More explicit errors: tibbles are strict. They throw lots of errors, meaning we catch mistakes early. --- class: newsection # Manipulation --- # Manipulation with `dplyr` ![:space 10] The `dplyr` package (part of the `tidyverse`) offers an intuitive **verb based** approach to data management in `R`. ![:space 5] The goal of the `dplyr` logic is to provide an easy, intuitive naming convention for ubiquitous to data management tasks. --- ## 6 main `dplyr` verbs ![:space 3] - **`select()`**: Pick variables by their names. - **`filter()`**: Pick observations by their values - **`arrange()`**: Reorder the rows. - **`mutate()`**: Create new variables with functions of existing variables. - **`summarise()`**: Collapse many values down to a single summary. - **`group_by()`**: changes the scope of each function from operating on the entire dataset to operating on it group-by-group. --- class:biglist ## How they work... <br> <br> **All verbs work similarly:** 1. The first argument is a data frame. 2. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes). 3. The result is a new data frame. --- To walk through the performance of the main `dplyr` verbs, we'll use an internal dataset called `presidential`. ```r ?presidential # for information on the data ``` ```r dat <- presidential head(dat) ``` ``` ## # A tibble: 6 x 4 ## name start end party ## <chr> <date> <date> <chr> ## 1 Eisenhower 1953-01-20 1961-01-20 Republican ## 2 Kennedy 1961-01-20 1963-11-22 Democratic ## 3 Johnson 1963-11-22 1969-01-20 Democratic ## 4 Nixon 1969-01-20 1974-08-09 Republican ## 5 Ford 1974-08-09 1977-01-20 Republican ## 6 Carter 1977-01-20 1981-01-20 Democratic ``` --- # `select()` .center[<img src="Figures/select.png">] --- ```r select(dat,name,party) ``` ``` ## # A tibble: 11 x 2 ## name party ## <chr> <chr> ## 1 Eisenhower Republican ## 2 Kennedy Democratic ## 3 Johnson Democratic ## 4 Nixon Republican ## 5 Ford Republican ## 6 Carter Democratic ## 7 Reagan Republican ## 8 Bush Republican ## 9 Clinton Democratic ## 10 Bush Republican ## 11 Obama Democratic ``` --- Or variable ranges using `:` The following will provide all variables in-between `name` and `end`. ```r select(dat,name:end) ``` ``` ## # A tibble: 11 x 3 ## name start end ## <chr> <date> <date> ## 1 Eisenhower 1953-01-20 1961-01-20 ## 2 Kennedy 1961-01-20 1963-11-22 ## 3 Johnson 1963-11-22 1969-01-20 ## 4 Nixon 1969-01-20 1974-08-09 ## 5 Ford 1974-08-09 1977-01-20 ## 6 Carter 1977-01-20 1981-01-20 ## 7 Reagan 1981-01-20 1989-01-20 ## 8 Bush 1989-01-20 1993-01-20 ## 9 Clinton 1993-01-20 2001-01-20 ## 10 Bush 2001-01-20 2009-01-20 ## 11 Obama 2009-01-20 2017-01-20 ``` --- The **order** in which variables are selected will translate to the output. Thus, one can easily **reorder columns** with `select()`. ```r select(dat,name,end,start) ``` ``` ## # A tibble: 11 x 3 ## name end start ## <chr> <date> <date> ## 1 Eisenhower 1961-01-20 1953-01-20 ## 2 Kennedy 1963-11-22 1961-01-20 ## 3 Johnson 1969-01-20 1963-11-22 ## 4 Nixon 1974-08-09 1969-01-20 ## 5 Ford 1977-01-20 1974-08-09 ## 6 Carter 1981-01-20 1977-01-20 ## 7 Reagan 1989-01-20 1981-01-20 ## 8 Bush 1993-01-20 1989-01-20 ## 9 Clinton 2001-01-20 1993-01-20 ## 10 Bush 2009-01-20 2001-01-20 ## 11 Obama 2017-01-20 2009-01-20 ``` --- We can also easily **rename** variables by simply providing a new name within the function. ```r select(dat,president=name, startdate=start, enddate=end) ``` ``` ## # A tibble: 11 x 3 ## president startdate enddate ## <chr> <date> <date> ## 1 Eisenhower 1953-01-20 1961-01-20 ## 2 Kennedy 1961-01-20 1963-11-22 ## 3 Johnson 1963-11-22 1969-01-20 ## 4 Nixon 1969-01-20 1974-08-09 ## 5 Ford 1974-08-09 1977-01-20 ## 6 Carter 1977-01-20 1981-01-20 ## 7 Reagan 1981-01-20 1989-01-20 ## 8 Bush 1989-01-20 1993-01-20 ## 9 Clinton 1993-01-20 2001-01-20 ## 10 Bush 2001-01-20 2009-01-20 ## 11 Obama 2009-01-20 2017-01-20 ``` --- Lastly, `select()` offers us a convenient way to drop variables by using the same logic that we employed with putting a **negative sign** in front of a dimension. The only difference here is that we can do the same but with a variable name. Here we **drop** the `start` date variable. ```r select(dat,-start) ``` ``` ## # A tibble: 11 x 3 ## name end party ## <chr> <date> <chr> ## 1 Eisenhower 1961-01-20 Republican ## 2 Kennedy 1963-11-22 Democratic ## 3 Johnson 1969-01-20 Democratic ## 4 Nixon 1974-08-09 Republican ## 5 Ford 1977-01-20 Republican ## 6 Carter 1981-01-20 Democratic ## 7 Reagan 1989-01-20 Republican ## 8 Bush 1993-01-20 Republican ## 9 Clinton 2001-01-20 Democratic ## 10 Bush 2009-01-20 Republican ## 11 Obama 2017-01-20 Democratic ``` --- ## `filter()` ![:space 3] .center[<img src="Figures/filter.png">] --- ```r filter(dat,party == "Republican") ``` ``` ## # A tibble: 6 x 4 ## name start end party ## <chr> <date> <date> <chr> ## 1 Eisenhower 1953-01-20 1961-01-20 Republican ## 2 Nixon 1969-01-20 1974-08-09 Republican ## 3 Ford 1974-08-09 1977-01-20 Republican ## 4 Reagan 1981-01-20 1989-01-20 Republican ## 5 Bush 1989-01-20 1993-01-20 Republican ## 6 Bush 2001-01-20 2009-01-20 Republican ``` --- ## `arrange()` ![:space 3] .center[<img src="Figures/arrange.png">] --- ```r arrange(dat,party) ``` ``` ## # A tibble: 11 x 4 ## name start end party ## <chr> <date> <date> <chr> ## 1 Kennedy 1961-01-20 1963-11-22 Democratic ## 2 Johnson 1963-11-22 1969-01-20 Democratic ## 3 Carter 1977-01-20 1981-01-20 Democratic ## 4 Clinton 1993-01-20 2001-01-20 Democratic ## 5 Obama 2009-01-20 2017-01-20 Democratic ## 6 Eisenhower 1953-01-20 1961-01-20 Republican ## 7 Nixon 1969-01-20 1974-08-09 Republican ## 8 Ford 1974-08-09 1977-01-20 Republican ## 9 Reagan 1981-01-20 1989-01-20 Republican ## 10 Bush 1989-01-20 1993-01-20 Republican ## 11 Bush 2001-01-20 2009-01-20 Republican ``` --- `arrange()` with the internal function `desc()` can change to a **descending** ordering. ```r arrange(dat,desc(start)) ``` ``` ## # A tibble: 11 x 4 ## name start end party ## <chr> <date> <date> <chr> ## 1 Obama 2009-01-20 2017-01-20 Democratic ## 2 Bush 2001-01-20 2009-01-20 Republican ## 3 Clinton 1993-01-20 2001-01-20 Democratic ## 4 Bush 1989-01-20 1993-01-20 Republican ## 5 Reagan 1981-01-20 1989-01-20 Republican ## 6 Carter 1977-01-20 1981-01-20 Democratic ## 7 Ford 1974-08-09 1977-01-20 Republican ## 8 Nixon 1969-01-20 1974-08-09 Republican ## 9 Johnson 1963-11-22 1969-01-20 Democratic ## 10 Kennedy 1961-01-20 1963-11-22 Democratic ## 11 Eisenhower 1953-01-20 1961-01-20 Republican ``` --- ## `mutate()` ![:space 3] .center[<img src="Figures/mutate.png">] --- ```r mutate(dat, # in office during cold war CW = start <= '1990-03-11') ``` ``` ## # A tibble: 11 x 5 ## name start end party CW ## <chr> <date> <date> <chr> <lgl> ## 1 Eisenhower 1953-01-20 1961-01-20 Republican TRUE ## 2 Kennedy 1961-01-20 1963-11-22 Democratic TRUE ## 3 Johnson 1963-11-22 1969-01-20 Democratic TRUE ## 4 Nixon 1969-01-20 1974-08-09 Republican TRUE ## 5 Ford 1974-08-09 1977-01-20 Republican TRUE ## 6 Carter 1977-01-20 1981-01-20 Democratic TRUE ## 7 Reagan 1981-01-20 1989-01-20 Republican TRUE ## 8 Bush 1989-01-20 1993-01-20 Republican TRUE ## 9 Clinton 1993-01-20 2001-01-20 Democratic FALSE ## 10 Bush 2001-01-20 2009-01-20 Republican FALSE ## 11 Obama 2009-01-20 2017-01-20 Democratic FALSE ``` --- `mutate()` also allows us to **_instantly_** use variables we just created. ```r mutate(dat, CW = start <= '1990-03-11', CW = as.numeric(CW)) ``` ``` ## # A tibble: 11 x 5 ## name start end party CW ## <chr> <date> <date> <chr> <dbl> ## 1 Eisenhower 1953-01-20 1961-01-20 Republican 1 ## 2 Kennedy 1961-01-20 1963-11-22 Democratic 1 ## 3 Johnson 1963-11-22 1969-01-20 Democratic 1 ## 4 Nixon 1969-01-20 1974-08-09 Republican 1 ## 5 Ford 1974-08-09 1977-01-20 Republican 1 ## 6 Carter 1977-01-20 1981-01-20 Democratic 1 ## 7 Reagan 1981-01-20 1989-01-20 Republican 1 ## 8 Bush 1989-01-20 1993-01-20 Republican 1 ## 9 Clinton 1993-01-20 2001-01-20 Democratic 0 ## 10 Bush 2001-01-20 2009-01-20 Republican 0 ## 11 Obama 2009-01-20 2017-01-20 Democratic 0 ``` --- Like `mutate()`, `transmute()` provides a method for generating a new variable, but unlike the former, it **returns only the newly created variable**. ```r transmute(dat,CW = start <= '1990-03-11') ``` ``` ## # A tibble: 11 x 1 ## CW ## <lgl> ## 1 TRUE ## 2 TRUE ## 3 TRUE ## 4 TRUE ## 5 TRUE ## 6 TRUE ## 7 TRUE ## 8 TRUE ## 9 FALSE ## 10 FALSE ## 11 FALSE ``` --- ## `summarize()` ```r summarize(dat, days_in_office = mean(end-start), max = max(end-start), min = min(end-start)) ``` ``` ## # A tibble: 1 x 3 ## days_in_office max min ## <drtn> <drtn> <drtn> ## 1 2125.091 days 2922 days 895 days ``` --- There are a number of internal functions that can be used with `mutate()`, `transmute()`, and `summarize()`. - `n()` ::: counts the number of observations - `n_distinct()` ::: counts the number of distinct entries ```r summarize(dat,N=n(),N_party=n_distinct(party)) ``` ``` ## # A tibble: 1 x 2 ## N N_party ## <int> <int> ## 1 11 2 ``` --- ## `group_by()` ![:space 3] .center[<img src="Figures/group_by.png">] --- When used in conjunction with some of the other functions, `group_by()` becomes a powerful **clustering function**. ```r # group by party x <- group_by(dat,party) summarize(x,min_in_office = min(end-start)) ``` ``` ## # A tibble: 2 x 2 ## party min_in_office ## <chr> <drtn> ## 1 Democratic 1036 days ## 2 Republican 895 days ``` --- ### Other functions in the tidyverse... `tally()` or `count()` offers quick counts of a variable which can be quite useful when used alongside some of the other functions. Here we are seeing how many observations we have _by group_. ```r x <- group_by(dat,party) count(x) ``` ``` ## # A tibble: 2 x 2 ## # Groups: party [2] ## party n ## <chr> <int> ## 1 Democratic 5 ## 2 Republican 6 ``` --- ### Other functions in the tidyverse... `recode()` allows you to quickly **recode a variable** (though conditional statements prove to be the most efficient way to do this) ```r mutate(dat,party = recode(party,'Republican'=1,'Democratic'=0)) ``` ``` ## # A tibble: 11 x 4 ## name start end party ## <chr> <date> <date> <dbl> ## 1 Eisenhower 1953-01-20 1961-01-20 1 ## 2 Kennedy 1961-01-20 1963-11-22 0 ## 3 Johnson 1963-11-22 1969-01-20 0 ## 4 Nixon 1969-01-20 1974-08-09 1 ## 5 Ford 1974-08-09 1977-01-20 1 ## 6 Carter 1977-01-20 1981-01-20 0 ## 7 Reagan 1981-01-20 1989-01-20 1 ## 8 Bush 1989-01-20 1993-01-20 1 ## 9 Clinton 1993-01-20 2001-01-20 0 ## 10 Bush 2001-01-20 2009-01-20 1 ## 11 Obama 2009-01-20 2017-01-20 0 ``` --- ### And much more... We've only covered a few functions here. Here are some more... - `sample_n()` - Grab an N random sample of your data - `sample_frac()` - Grab a random sample that is some fraction of your total data - `top_n()` - get the top N number of entries from a data frame - `slice()` - grab specific row ranges - `glimpse()` - quickly preview the data See [here](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf) to download a cheatsheet containing an entire list of the tidyverse `dplyr` verbs. --- class:newsection # Piping --- ## Combining `dplyr` functions When we need to do a series of manipulations, we can **perform each manipulation individually and save each entry as a new object** that we write over. ```r x <- filter(presidential,party=='Republican') x <- group_by(x,name) x <- transmute(x,t_in_office = end-start) x <- arrange(x,t_in_office) x ``` ``` ## # A tibble: 6 x 2 ## # Groups: name [5] ## name t_in_office ## <chr> <drtn> ## 1 Ford 895 days ## 2 Bush 1461 days ## 3 Nixon 2027 days ## 4 Eisenhower 2922 days ## 5 Reagan 2922 days ## 6 Bush 2922 days ``` --- Or we can **nest** functions _within_ each other. ```r arrange( transmute( group_by( filter(presidential,party=='Republican'),name), t_in_office = end-start), t_in_office) ``` ``` ## # A tibble: 6 x 2 ## # Groups: name [5] ## name t_in_office ## <chr> <drtn> ## 1 Ford 895 days ## 2 Bush 1461 days ## 3 Nixon 2027 days ## 4 Eisenhower 2922 days ## 5 Reagan 2922 days ## 6 Bush 2922 days ``` -- > The issue with **nesting functions** is that it is <font color = 'red'> difficult to (a) read and (b) detect a mistake!</font> --- ### Piping Functions The **pipe** is a useful tool that allows us to **pass** output from one function to the next. To pipe we write **`%>%`** _in-between_ each function. .center[`data %>% function1() %>% function2()`] We pass the output to specific locations in the proceeding function using the pointer `.` .center[`data %>% function1(arg = .)`] Piping offers is a clean way of manipulating data that is **intuitive and easy to read**. --- Here we **pass** our `data` to `filter()` then to `group_by()` then to `transmute()` and then finally to `arrange()` which returns our output! ```r presidential %>% filter(party=='Republican') %>% group_by(name) %>% transmute(t_in_office = end-start) %>% arrange(t_in_office) ``` ``` ## # A tibble: 6 x 2 ## # Groups: name [5] ## name t_in_office ## <chr> <drtn> ## 1 Ford 895 days ## 2 Bush 1461 days ## 3 Nixon 2027 days ## 4 Eisenhower 2922 days ## 5 Reagan 2922 days ## 6 Bush 2922 days ``` --- ### Two things to keep in mind when piping... 1. Functions **_must_** be linked with `%>%` 2. When functions have **multiple arguments**, point to where the data should go with a period (`.`) ```r data %>% function(arg1= ., arg2=TRUE) ``` --- class:newsection # Joining --- ## `_join` functions (dplyr) ![:space 5] `dplyr` offers a range of joining/merging functions that are more intuitive to use. These functions provide a **SQL framework** that is easier to read and more efficient. - `left_join()` - `right_join()` - `inner_join()` - `full_join()` - `anti_join()` When joining data, you must have a **unique** identifier on the dimension you're matching on. --- Consider the following two example datasets... ```r data_A ``` ``` ## country Var1 ## 1 Nigeria 4 ## 2 England 3 ## 3 Botswana 6 ``` ```r data_B ``` ``` ## country Var2 ## 1 Nigeria Low ## 2 United States High ## 3 Botswana Medium ``` --- ## `left_join()` .center[<img src="Figures/left_join.png", height=100px>] .center[<img src="Figures/left-join.gif", height=400px>] --- ## `left_join()` .center[<img src="Figures/left_join.png", height=100px>] ```r left_join(data_A,data_B,by="country") ``` ``` ## country Var1 Var2 ## 1 Nigeria 4 Low ## 2 England 3 <NA> ## 3 Botswana 6 Medium ``` --- ## `right_join()` .center[<img src="Figures/right_join.png", height=100px>] .center[<img src="Figures/right-join.gif", height=400px>] --- ## `right_join()` .center[<img src="Figures/right_join.png", height=100px>] ```r right_join(data_A,data_B,by="country") ``` ``` ## country Var1 Var2 ## 1 Nigeria 4 Low ## 2 Botswana 6 Medium ## 3 United States NA High ``` --- ## `inner_join()` .center[<img src="Figures/inner_join.png", height=100px>] .center[<img src="Figures/inner-join.gif", height=400px>] --- ## `inner_join()` .center[<img src="Figures/inner_join.png", height=100px>] ```r inner_join(data_A,data_B,by="country") ``` ``` ## country Var1 Var2 ## 1 Nigeria 4 Low ## 2 Botswana 6 Medium ``` --- ### `full_join()` .center[<img src="Figures/full_join.png", height=100px>] .center[<img src="Figures/full-join.gif", height=400px>] --- ### `full_join()` <center> <img src="Figures/full_join.png", height=100px> </center> ```r full_join(data_A,data_B,by="country") ``` ``` ## country Var1 Var2 ## 1 Nigeria 4 Low ## 2 England 3 <NA> ## 3 Botswana 6 Medium ## 4 United States NA High ``` --- ### `anti_join()` .center[<img src="Figures/anti_join_left.png", height=100px>] .center[<img src="Figures/anti-join.gif", height=400px>] --- ### `anti_join()` .center[<img src="Figures/anti_join_left.png", height=100px>] ```r anti_join(data_A,data_B,by="country") ``` ``` ## country Var1 ## 1 England 3 ``` --- ## `bind_rows()` <center> <img src="Figures/rbind.png"> </center> ```r bind_rows(data_A,data_B) ``` ``` ## country Var1 Var2 ## 1 Nigeria 4 <NA> ## 2 England 3 <NA> ## 3 Botswana 6 <NA> ## 4 Nigeria NA Low ## 5 United States NA High ## 6 Botswana NA Medium ``` --- ## `bind_cols()` <center> <img src="Figures/cbind.png"> </center> ```r bind_cols(data_A,data_B) ``` ``` ## country...1 Var1 country...3 Var2 ## 1 Nigeria 4 Nigeria Low ## 2 England 3 United States High ## 3 Botswana 6 Botswana Medium ``` --- ## Disparate column names Sometimes the naming conventions of two datasets don't perfectly align. When this happens, we can specify how data merges onto one another more explicitly using the `by=` argument. Moreover, we can merge on **_more_ than one dimension** by specifying all relevant column names. --- Once again, consider the following example data.. ```r data_A ``` ``` ## country year Var1 ## 1 Nigeria 1999 4 ## 2 England 2001 3 ## 3 Botswana 2000 6 ``` ```r data_B ``` ``` ## country_name year Var2 ## 1 Nigeria 1999 Low ## 2 United States 2004 High ## 3 Botswana 2003 Medium ``` --- ```r full_join(data_A,data_B, by=c('country'='country_name', 'year')) ``` ``` ## country year Var1 Var2 ## 1 Nigeria 1999 4 Low ## 2 England 2001 3 <NA> ## 3 Botswana 2000 6 <NA> ## 4 United States 2004 NA High ## 5 Botswana 2003 NA Medium ``` --- ## Merging as a subsetting strategy Using set operations to subset and join data... .center[<img src="Figures/intersect.gif", height=400px>] --- ## Merging as a subsetting strategy Using set operations to subset and join data... .center[<img src="Figures/setdiff.gif", height=400px>] --- ## Merging as a subsetting strategy Using set operations to subset and join data... <br><br><br> .center[<img src="Figures/union.gif", height=400px>] --- ## Merging as a subsetting strategy Using set operations to subset and join data... <br><br><br><br> .center[<img src="Figures/union-all.gif", height=400px>] --- class:newsection # Reshaping Data --- ![:space 10] Often, we need to alter the structure of a `data.frame` from a **wide format**... ![:space 10] .center[ <table> <thead> <tr> <th style="text-align:center;"> country </th> <th style="text-align:center;"> 1992 </th> <th style="text-align:center;"> 1993 </th> <th style="text-align:center;"> 1994 </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> Nigeria </td> <td style="text-align:center;"> 9.72 </td> <td style="text-align:center;"> 10.06 </td> <td style="text-align:center;"> 9.66 </td> </tr> <tr> <td style="text-align:center;"> Iran </td> <td style="text-align:center;"> 9.88 </td> <td style="text-align:center;"> 10.86 </td> <td style="text-align:center;"> 9.78 </td> </tr> <tr> <td style="text-align:center;"> Cambodia </td> <td style="text-align:center;"> 10.78 </td> <td style="text-align:center;"> 10.23 </td> <td style="text-align:center;"> 10.61 </td> </tr> <tr> <td style="text-align:center;"> Australia </td> <td style="text-align:center;"> 10.04 </td> <td style="text-align:center;"> 9.37 </td> <td style="text-align:center;"> 10.18 </td> </tr> </tbody> </table> ] --- ...into a **long format** .center[ <table> <thead> <tr> <th style="text-align:center;"> country </th> <th style="text-align:center;"> year </th> <th style="text-align:center;"> var </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> Nigeria </td> <td style="text-align:center;"> 1992 </td> <td style="text-align:center;"> 9.72 </td> </tr> <tr> <td style="text-align:center;"> Nigeria </td> <td style="text-align:center;"> 1993 </td> <td style="text-align:center;"> 10.06 </td> </tr> <tr> <td style="text-align:center;"> Nigeria </td> <td style="text-align:center;"> 1994 </td> <td style="text-align:center;"> 9.66 </td> </tr> <tr> <td style="text-align:center;"> Iran </td> <td style="text-align:center;"> 1992 </td> <td style="text-align:center;"> 9.88 </td> </tr> <tr> <td style="text-align:center;"> Iran </td> <td style="text-align:center;"> 1993 </td> <td style="text-align:center;"> 10.86 </td> </tr> <tr> <td style="text-align:center;"> Iran </td> <td style="text-align:center;"> 1994 </td> <td style="text-align:center;"> 9.78 </td> </tr> <tr> <td style="text-align:center;"> Cambodia </td> <td style="text-align:center;"> 1992 </td> <td style="text-align:center;"> 10.78 </td> </tr> <tr> <td style="text-align:center;"> Cambodia </td> <td style="text-align:center;"> 1993 </td> <td style="text-align:center;"> 10.23 </td> </tr> <tr> <td style="text-align:center;"> Cambodia </td> <td style="text-align:center;"> 1994 </td> <td style="text-align:center;"> 10.61 </td> </tr> <tr> <td style="text-align:center;"> Australia </td> <td style="text-align:center;"> 1992 </td> <td style="text-align:center;"> 10.04 </td> </tr> <tr> <td style="text-align:center;"> Australia </td> <td style="text-align:center;"> 1993 </td> <td style="text-align:center;"> 9.37 </td> </tr> <tr> <td style="text-align:center;"> Australia </td> <td style="text-align:center;"> 1994 </td> <td style="text-align:center;"> 10.18 </td> </tr> </tbody> </table> ] --- ![:space 20] .pull-left[ ![:center_img 85](Figures/tidyr_logo.png) ] .pull-right[ `tidyr` is a tidyverse package built to help reshape data. The package contains an array of functions that are all useful cleaning a data construct. `tidyr` eases tasks such as: - dropping missing values - filling missing values - separating a column into two variables or uniting two columns into one ] --- ### `pivot_longer()`: from wide-to-long ![:space 15] ![:center_img 100](Figures/gather.png) --- ### `pivot_longer()`: from wide-to-long ![:space 10] ```r dat ``` ``` ## country 1992 1993 1994 ## 1 Nigeria 9.72 10.06 9.66 ## 2 Iran 9.88 10.86 9.78 ## 3 Cambodia 10.78 10.23 10.61 ## 4 Australia 10.04 9.37 10.18 ``` --- ### `pivot_longer()`: from wide-to-long Specify the columns you want to expand ```r dat %>% pivot_longer(cols="1992":"1994") ``` ``` ## # A tibble: 12 x 3 ## country name value ## <fct> <chr> <dbl> ## 1 Nigeria 1992 9.72 ## 2 Nigeria 1993 10.1 ## 3 Nigeria 1994 9.66 ## 4 Iran 1992 9.88 ## 5 Iran 1993 10.9 ## 6 Iran 1994 9.78 ## 7 Cambodia 1992 10.8 ## 8 Cambodia 1993 10.2 ## 9 Cambodia 1994 10.6 ## 10 Australia 1992 10.0 ## 11 Australia 1993 9.37 ## 12 Australia 1994 10.2 ``` --- ### `pivot_longer()`: from wide-to-long Rename the expanded columns ```r dat %>% pivot_longer(cols="1992":"1994", names_to = "year", values_to = "ln_gdppc") ``` ``` ## # A tibble: 12 x 3 ## country year ln_gdppc ## <fct> <chr> <dbl> ## 1 Nigeria 1992 9.72 ## 2 Nigeria 1993 10.1 ## 3 Nigeria 1994 9.66 ## 4 Iran 1992 9.88 ## 5 Iran 1993 10.9 ## 6 Iran 1994 9.78 ## 7 Cambodia 1992 10.8 ## 8 Cambodia 1993 10.2 ## 9 Cambodia 1994 10.6 ## 10 Australia 1992 10.0 ## 11 Australia 1993 9.37 ## 12 Australia 1994 10.2 ``` --- ### `pivot_longer()`: from wide-to-long ![:space 10] Be selective about which columns are expanded and which are not. ```r # variables can be excluded from the reshape dat %>% pivot_longer(cols="1992", names_to = "year", values_to = "ln_gdppc") ``` ``` ## # A tibble: 4 x 5 ## country `1993` `1994` year ln_gdppc ## <fct> <dbl> <dbl> <chr> <dbl> ## 1 Nigeria 10.1 9.66 1992 9.72 ## 2 Iran 10.9 9.78 1992 9.88 ## 3 Cambodia 10.2 10.6 1992 10.8 ## 4 Australia 9.37 10.2 1992 10.0 ``` --- ## `pivot_wider()`: from long-to-wide ![:space 10] .center[<img src ="Figures/spread.png">] --- ## `pivot_wider()`: from long-to-wide ![:space 10] ```r dat ``` ``` ## country year ln_gdppc ## 1 Nigeria 1992 9.72 ## 3 Cambodia 1992 10.78 ## 4 Australia 1992 10.04 ## 5 Nigeria 1993 10.06 ## 6 Iran 1993 10.86 ## 8 Australia 1993 9.37 ## 9 Nigeria 1994 9.66 ## 10 Iran 1994 9.78 ## 11 Cambodia 1994 10.61 ## 12 Australia 1994 10.18 ``` --- ## `pivot_wider()`: from long-to-wide Main arguments: - `names_from`: name of the variable that will be spread out into columns. - `values_from`: values that will populate the cells of each column ![:space 3] ```r dat %>% pivot_wider(names_from = year, values_from = ln_gdppc) ``` ``` ## # A tibble: 4 x 4 ## country `1992` `1993` `1994` ## <fct> <dbl> <dbl> <dbl> ## 1 Nigeria 9.72 10.1 9.66 ## 2 Cambodia 10.8 NA 10.6 ## 3 Australia 10.0 9.37 10.2 ## 4 Iran NA 10.9 9.78 ``` --- ## `pivot_wider()`: from long-to-wide Main arguments: - `names_from`: name of the variable that will be spread out into columns. - `values_from`: values that will populate the cells of each column. - `values_fill`: specify what value a missing value should take on. ```r dat %>% pivot_wider(names_from = year, values_from = ln_gdppc, values_fill = list(ln_gdppc = -99)) ``` ``` ## # A tibble: 4 x 4 ## country `1992` `1993` `1994` ## <fct> <dbl> <dbl> <dbl> ## 1 Nigeria 9.72 10.1 9.66 ## 2 Cambodia 10.8 -99 10.6 ## 3 Australia 10.0 9.37 10.2 ## 4 Iran -99 10.9 9.78 ``` --- ## Legacy code ![:space 5] Note that `pivot_wider()` and `pivot_longer()` are relatively new wrappers for the older `gather()` and `spread()` functions. ![:space 5] What does this mean for you? - Answers online and in the reading on how to reshape data may differ depending on the preference of the source. - Use whatever versions of the function that work best for you. - Many find that `pivot_wider()` and `pivot_longer()` are easier to use than the older `gather()` and `spread()` functions. --- class:newsection # Visualization --- .pull-left[<br><br><br><br><img src = "Figures/ggplot-hex.png">] .pull-right[ `ggplot2` (a part of the `tidyverse` package) is a power graphics package that offers a flexible and intuitive graphics language capable of building sophisticated graphics. <br><br> `ggplot` has a **special syntax** that we'll have to get used to, _but_ once we understand the basics, we'll be able to produce some advanced and sophisticated graphics with ease! ] --- .pull-left[<br><br><br><br><img src = "Figures/ggplot-hex.png">] .pull-right[ `ggplot2` is based on a **grammar of graphics**. In essence, you can build every graph from the same components that follow the same intuitive naming conventions. Every graph is composed of 1. a **dataset** 2. **coordinate system** 2. **mappings** → the variables we're aiming to visualize 3. **geom**etric expressions of how the data should be projected onto a space ] --- ### (1) data Let's use the `diamonds` data, which is an example dataset provided by `ggplot` that contains the prices and other attributes of almost 54,000 diamonds. ```r glimpse(diamonds) ``` ``` ## Rows: 53,940 ## Columns: 10 ## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0… ## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ve… ## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I… ## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1,… ## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 6… ## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 5… ## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 3… ## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4… ## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4… ## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2… ``` --- ### (2) coordinate system Use the `ggplot()` function to establish the coordinate system. ```r ggplot(data=diamonds) ``` <img src="week-02_ppol561_files/figure-html/unnamed-chunk-49-1.png" style="display: block; margin: auto;" /> --- ### (3) mappings What variables from the data do we want to map to the projected space? - What variable makes up the y-axis? - What variable makes up the x-axis? - Are there any variables to group by? (More on this later) -- <br><br> Need to use a special function `aes()` (short for "aesthetics") to map variables from the data to the geometric space. Whenever we want to plot a variable feature, we **_must_** wrap it in the `aes()` function. --- ### (3) mappings What variables from the data do we want to map to the projected space? ```r ggplot(data=diamonds,aes(x=price,y=carat)) ``` <img src="week-02_ppol561_files/figure-html/unnamed-chunk-50-1.png" style="display: block; margin: auto;" /> --- ### (4) geom → projection How should your mappings be projected onto the coordinate space? ```r ggplot(data=diamonds,aes(x=price,y=carat)) + * geom_point() ``` <img src="week-02_ppol561_files/figure-html/unnamed-chunk-51-1.png" style="display: block; margin: auto;" /> --- ### (4) geom → projection How should your mappings be projected onto the coordinate space? .pull-left[ - `geom_` are aesthetic **layers** that are mapped onto the plot. - We "add" layers and design preferences `+`. - We can add as many layers as we want. Layers placed on top of one another in accordance with the order that they are specified. - Plots can be assigned as objects and rendered later. ] .pull-right[ ```r ggplot(data=diamonds, aes(x=price,y=carat)) + geom_point() ``` <img src="week-02_ppol561_files/figure-html/unnamed-chunk-52-1.png" style="display: block; margin: auto;" /> ] --- .center[ <font color = "green">`ggplot`</font>(data = `<DATA>`) `+` <font color = "green">`<GEOM_FUNCTION>`</font>(mapping = <font color = "green">aes</font>(`<MAPPINGS>`)) ] -- .center[ `+` <font color = "green">`<GEOM_FUNCTION>`</font>(mapping = <font color = "green">aes</font>(`<MAPPINGS>`)) `+` <font color = "green">`<GEOM_FUNCTION>`</font>(mapping = <font color = "green">aes</font>(`<MAPPINGS>`)) `+` <font color = "green">`<GEOM_FUNCTION>`</font>(mapping = <font color = "green">aes</font>(`<MAPPINGS>`)) `$$\vdots$$` ] --- .center[ <font color = "green">`ggplot`</font>(data = `<DATA>`) `+` <font color = "green">`<GEOM_FUNCTION>`</font>(mapping = <font color = "green">aes</font>(`<MAPPINGS>`)) ] .center[ `+` <font color = "red">`<SCALE_FUNCTION>`</font>(mapping = <font color = "green">aes</font>(`<MAPPINGS>`)) `+` <font color = "blue">`<THEME_FUNCTION>`</font>(mapping = <font color = "green">aes</font>(`<MAPPINGS>`)) `+` <font color = "orange">`<FACET_FUNCTION>`</font>(mapping = <font color = "green">aes</font>(`<MAPPINGS>`)) `$$\vdots$$` ] --- ### One variable? .center[ | Expression | Function | | |----|----|-----| | Area | `geom_area()` | <img src = "Figures/geom_area.png"> | | Density | `geom_density()` | <img src = "Figures/geom_density.png"> | | Dots | `geom_dotplot()` | <img src = "Figures/geom_dotplot.png"> | | Frequencies | `geom_freqpoly()` | <img src = "Figures/geom_freqplot.png"> | | Histogram | `geom_histogram()` | <img src = "Figures/geom_histogram.png"> | ] --- ### Two variables? .center[ | Expression | Function | | |----|----|-----| | Continuous Points | `geom_point()` | <img src = "Figures/geom_point.png"> | | Continuous Lines | `geom_line()` | <img src = "Figures/geom_line.png"> | | Discrete Counts | `geom_count()` | <img src = "Figures/geom_count.png"> | | Continuous and Discrete Distributions | `geom_boxplot()` | <img src = "Figures/geom_boxplot.png"> | | Densities | `geom_hex()` | <img src = "Figures/geom_hex.png"> | ] --- ### Three variables? .center[ | Expression | Function | | |----|----|-----| | Densities | `geom_contour()` | <img src = "Figures/geom_contour.png"> | | Intensities | `geom_tile()` | <img src = "Figures/geom_tile.png"> | | Intensities | `geom_raster()` | <img src = "Figures/geom_raster.png"> | | Spatial | `geom_map()` | <img src = "Figures/geom_map.png"> | ] -- Just a taste. Wide array of ways to express data in a geometric space. See reading and [data visualization cheatsheet](https://github.com/tidyverse/ggplot2) for guidance. --- ### Function Types in `ggplot2` ![:space 5] | Type | Function Header | Description | |------|-----------------|-------------| | Generate layers from data | `geom_` | Use a geom function to represent data points, use the geom’s aesthetic properties to represent variables. Each function returns a layer. | | Construct statistics layers | `stat_` | A stat builds new variables to plot (e.g., count, prop) | | Change mapping characteristics | `scale_` | Scales map data values to the visual values of an aesthetic. To change a mapping, add a new scale. | | Generate subplots | `facet_` | Facets divide a plot into subplots based on the values of one or more discrete variables. | | Alter the plots theme | `theme_` | Change the aesthetics of the plot background and feature (e.g. axes, text, grid lines, etc.) | --- ### Exporting Plots Note that `ggplot` objects can assigned to an object. ```r my_plot <- ggplot(cars,aes(speed,dist)) + geom_point() my_plot ``` <img src="week-02_ppol561_files/figure-html/unnamed-chunk-53-1.png" style="display: block; margin: auto;" /> We can export (or build off of) these plot objects using `ggsave()` ```r ggsave(plot = my_plot,filename = "my_plot.pdf", device = "pdf",width=5,height = 5) ggsave(plot = my_plot,filename = "my_plot.png",device = "png",dpi = 300) ``` > Supports "eps", "ps", "tex" (pictex), "pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only).