class: center, middle, inverse, title-slide #
PPOL670 | Introduction to Data Science for Public Policy
Week 4
Data Wrangling
###
Prof. Eric Dunford ◆ Georgetown University ◆ McCourt School of Public Policy ◆
eric.dunford@georgetown.edu
--- layout: true <div class="slide-footer"><span> PPOL670 | Introduction to Data Science for Public Policy           Week 4 <!-- Week of the Footer Here -->              Data Wrangling <!-- Title of the lecture here --> </span></div> --- class: outline # Outline for Today ![:space 5] 90% of data science is **data wrangling**, which involves... - **manipulating** - **piping** - **joining** - **reshaping** ...data! --- class: newsection # Data Wrangling --- # What is data wrangling? ![:space 5] - **raw → processed**: the process of transforming data from one format to another. - **converting the structure** to facilitate some analysis + **_altering the unit of analysis_**: going from individuals in a state in a given year to state-year by + changing from a **_"wide"_** (many columns, few rows) **_to a "long" structure_** (few colums, many rows) + **_summarizing data_** across specific subgroups --- ## Thinking about Data Generation - **Data rarely works "out of the box"** + Constructed for a different purpose than what you're using it for + Requires adding additional information: - merging other data into it - creating new variables from old ones - changing the unit of analysis -- - **Codebooks offer clues, but not the story** + If a codebook exists, it can outline the data generation process, variable names, issues with the data, etc. + Codebooks can't won't reveal underlying issues or biases in the data. locating these requires exploration through summarization, visualization, and reshaping the data structure. --- ## Thinking about Data Generation ![:space 5] - **Data Provenance** + Data changes hands, edits are made, variables/observations are accidentally dropped. Without probing, one might overlook flaws. - **Automated Data** + Scraping data can ease collection, but difficult to know where/if the process broke down. + Manipulating data can reveal algorithmic failure/limitations. + Never assume "everything worked" --- ## `tidyverse` approach Throughout this course, we are going to cover the basics of data manipulation and visualization in `R`. By focusing on a suite of packages known as the "[tidyverse](https://www.tidyverse.org)". These packages were designed to ease the process of data manipulation and management so that it is more intuitive, efficient, and interpretable. Specifically, the `tidyverse` is a housing package that holds the following packages: - [readr](http://readr.tidyverse.org/) - for reading data in - [tibble](https://tibble.tidyverse.org/) - for "tidy" data structures - [dplyr](http://dplyr.tidyverse.org/) - for data manipulation - [ggplot2](http://ggplot2.tidyverse.org/) - for data visualization - [tidyr](http://tidyr.tidyverse.org/) - for cleaning - [purrr](http://purrr.tidyverse.org/) - functional programming toolkit --- ## `tidyverse` advantage ![:space 15] Most of everything we will do will require in data manipulation just **one** package ```r # Install tidyverse package install.packages('tidyverse') # Load the package require(tidyverse) ``` --- class:newsection # Tidy Data --- The same data can be represented in many different ways... ![:space 15] .center[ <table> <thead> <tr> <th style="text-align:center;"> country </th> <th style="text-align:center;"> year </th> <th style="text-align:center;"> cases </th> <th style="text-align:center;"> population </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> Afghanistan </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> 745 </td> <td style="text-align:center;"> 19987071 </td> </tr> <tr> <td style="text-align:center;"> Afghanistan </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> 2666 </td> <td style="text-align:center;"> 20595360 </td> </tr> <tr> <td style="text-align:center;"> Brazil </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> 37737 </td> <td style="text-align:center;"> 172006362 </td> </tr> <tr> <td style="text-align:center;"> Brazil </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> 80488 </td> <td style="text-align:center;"> 174504898 </td> </tr> <tr> <td style="text-align:center;"> China </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> 212258 </td> <td style="text-align:center;"> 1272915272 </td> </tr> <tr> <td style="text-align:center;"> China </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> 213766 </td> <td style="text-align:center;"> 1280428583 </td> </tr> </tbody> </table> ] --- <br><br> .center[ <table> <thead> <tr> <th style="text-align:center;"> country </th> <th style="text-align:center;"> year </th> <th style="text-align:center;"> type </th> <th style="text-align:center;"> count </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> Afghanistan </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> cases </td> <td style="text-align:center;"> 745 </td> </tr> <tr> <td style="text-align:center;"> Afghanistan </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> population </td> <td style="text-align:center;"> 19987071 </td> </tr> <tr> <td style="text-align:center;"> Afghanistan </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> cases </td> <td style="text-align:center;"> 2666 </td> </tr> <tr> <td style="text-align:center;"> Afghanistan </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> population </td> <td style="text-align:center;"> 20595360 </td> </tr> <tr> <td style="text-align:center;"> Brazil </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> cases </td> <td style="text-align:center;"> 37737 </td> </tr> <tr> <td style="text-align:center;"> Brazil </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> population </td> <td style="text-align:center;"> 172006362 </td> </tr> <tr> <td style="text-align:center;"> Brazil </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> cases </td> <td style="text-align:center;"> 80488 </td> </tr> <tr> <td style="text-align:center;"> Brazil </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> population </td> <td style="text-align:center;"> 174504898 </td> </tr> <tr> <td style="text-align:center;"> China </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> cases </td> <td style="text-align:center;"> 212258 </td> </tr> <tr> <td style="text-align:center;"> China </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> population </td> <td style="text-align:center;"> 1272915272 </td> </tr> <tr> <td style="text-align:center;"> China </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> cases </td> <td style="text-align:center;"> 213766 </td> </tr> <tr> <td style="text-align:center;"> China </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> population </td> <td style="text-align:center;"> 1280428583 </td> </tr> </tbody> </table> ] --- <br><br><br> .center[ <table> <thead> <tr> <th style="text-align:center;"> country </th> <th style="text-align:center;"> year </th> <th style="text-align:center;"> rate </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> Afghanistan </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> 745/19987071 </td> </tr> <tr> <td style="text-align:center;"> Afghanistan </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> 2666/20595360 </td> </tr> <tr> <td style="text-align:center;"> Brazil </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> 37737/172006362 </td> </tr> <tr> <td style="text-align:center;"> Brazil </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> 80488/174504898 </td> </tr> <tr> <td style="text-align:center;"> China </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> 212258/1272915272 </td> </tr> <tr> <td style="text-align:center;"> China </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> 213766/1280428583 </td> </tr> </tbody> </table> ] --- <br><br> .center[ <table> <caption>Cases</caption> <thead> <tr> <th style="text-align:center;"> country </th> <th style="text-align:center;"> 1999 </th> <th style="text-align:center;"> 2000 </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> Afghanistan </td> <td style="text-align:center;"> 745 </td> <td style="text-align:center;"> 2666 </td> </tr> <tr> <td style="text-align:center;"> Brazil </td> <td style="text-align:center;"> 37737 </td> <td style="text-align:center;"> 80488 </td> </tr> <tr> <td style="text-align:center;"> China </td> <td style="text-align:center;"> 212258 </td> <td style="text-align:center;"> 213766 </td> </tr> </tbody> </table> ] <br> .center[ <table> <caption>Population</caption> <thead> <tr> <th style="text-align:center;"> country </th> <th style="text-align:center;"> 1999 </th> <th style="text-align:center;"> 2000 </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> Afghanistan </td> <td style="text-align:center;"> 19987071 </td> <td style="text-align:center;"> 20595360 </td> </tr> <tr> <td style="text-align:center;"> Brazil </td> <td style="text-align:center;"> 172006362 </td> <td style="text-align:center;"> 174504898 </td> </tr> <tr> <td style="text-align:center;"> China </td> <td style="text-align:center;"> 1272915272 </td> <td style="text-align:center;"> 1280428583 </td> </tr> </tbody> </table> ] --- There are three interrelated rules which make a dataset **tidy**: <br><br> 1. **Each variable must have its own column.** 2. **Each observation must have its own row.** 3. **Each value must have its own cell.** .center[<img src="Figures/tidy-data.png">] --- ## Tidy! <br><br> .center[ <table> <thead> <tr> <th style="text-align:center;"> country </th> <th style="text-align:center;"> year </th> <th style="text-align:center;"> cases </th> <th style="text-align:center;"> population </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> Afghanistan </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> 745 </td> <td style="text-align:center;"> 19987071 </td> </tr> <tr> <td style="text-align:center;"> Afghanistan </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> 2666 </td> <td style="text-align:center;"> 20595360 </td> </tr> <tr> <td style="text-align:center;"> Brazil </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> 37737 </td> <td style="text-align:center;"> 172006362 </td> </tr> <tr> <td style="text-align:center;"> Brazil </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> 80488 </td> <td style="text-align:center;"> 174504898 </td> </tr> <tr> <td style="text-align:center;"> China </td> <td style="text-align:center;"> 1999 </td> <td style="text-align:center;"> 212258 </td> <td style="text-align:center;"> 1272915272 </td> </tr> <tr> <td style="text-align:center;"> China </td> <td style="text-align:center;"> 2000 </td> <td style="text-align:center;"> 213766 </td> <td style="text-align:center;"> 1280428583 </td> </tr> </tbody> </table> ] --- ## `tibble()` data frames ![:space 5] > "Tibbles _are_ data frames, but they tweak some older behaviours to make life a little easier." - Hadley Wickham ![:space 5] Differences between `tibbles` and `data.frames` 1. Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data. 2. More explicit errors: tibbles are strict. They throw lots of errors, meaning we catch mistakes early. --- class: newsection # Manipulating Data<br> --- ![:space 20] .pull-left[ ![:center_img 85](Figures/dplyr_logo.png) ] .pull-right[ - The `dplyr` package (part of the `tidyverse`) offers an intuitive **_verb based_** approach to data management in `R`. - The goal of the `dplyr` logic is to provide an easy, intuitive naming convention for ubiquitous to data management tasks. ] ![:space 60] <!-- > Note that the tidyverse moves to a `tibble` data.frame, which is a special kind of data.frame class that keeps the best parts of `data.frame` but drops the unnecessary bits, such as printing the entire data object. --> --- ## 6 main `dplyr` verbs ![:space 3] - **`select()`**: Pick variables by their names. - **`filter()`**: Pick observations by their values - **`arrange()`**: Reorder the rows. - **`mutate()`**: Create new variables with functions of existing variables. - **`summarise()`**: Collapse many values down to a single summary. - **`group_by()`**: changes the scope of each function from operating on the entire dataset to operating on it group-by-group. --- class:biglist ## How they work... <br> <br> **All verbs work similarly:** 1. The first argument is a data frame. 2. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes). 3. The result is a new data frame. --- To walk through the performance of the main `dplyr` verbs, we'll use an internal dataset called `presidential`. ```r ?presidential # for information on the data ``` ```r dat <- presidential head(dat) ``` ``` ## # A tibble: 6 x 4 ## name start end party ## <chr> <date> <date> <chr> ## 1 Eisenhower 1953-01-20 1961-01-20 Republican ## 2 Kennedy 1961-01-20 1963-11-22 Democratic ## 3 Johnson 1963-11-22 1969-01-20 Democratic ## 4 Nixon 1969-01-20 1974-08-09 Republican ## 5 Ford 1974-08-09 1977-01-20 Republican ## 6 Carter 1977-01-20 1981-01-20 Democratic ``` --- # `select()` .center[<img src="Figures/select.png">] --- ![:space 10] ```r select(dat,name,party) ``` ``` ## # A tibble: 11 x 2 ## name party ## <chr> <chr> ## 1 Eisenhower Republican ## 2 Kennedy Democratic ## 3 Johnson Democratic ## 4 Nixon Republican ## 5 Ford Republican ## 6 Carter Democratic ## 7 Reagan Republican ## 8 Bush Republican ## 9 Clinton Democratic ## 10 Bush Republican ## 11 Obama Democratic ``` --- Or variable ranges using `:` The following will provide all variables in-between `name` and `end`. ```r select(dat,name:end) ``` ``` ## # A tibble: 11 x 3 ## name start end ## <chr> <date> <date> ## 1 Eisenhower 1953-01-20 1961-01-20 ## 2 Kennedy 1961-01-20 1963-11-22 ## 3 Johnson 1963-11-22 1969-01-20 ## 4 Nixon 1969-01-20 1974-08-09 ## 5 Ford 1974-08-09 1977-01-20 ## 6 Carter 1977-01-20 1981-01-20 ## 7 Reagan 1981-01-20 1989-01-20 ## 8 Bush 1989-01-20 1993-01-20 ## 9 Clinton 1993-01-20 2001-01-20 ## 10 Bush 2001-01-20 2009-01-20 ## 11 Obama 2009-01-20 2017-01-20 ``` --- The **order** in which variables are selected will translate to the output. Thus, one can easily **reorder columns** with `select()`. ```r select(dat,name,end,start) ``` ``` ## # A tibble: 11 x 3 ## name end start ## <chr> <date> <date> ## 1 Eisenhower 1961-01-20 1953-01-20 ## 2 Kennedy 1963-11-22 1961-01-20 ## 3 Johnson 1969-01-20 1963-11-22 ## 4 Nixon 1974-08-09 1969-01-20 ## 5 Ford 1977-01-20 1974-08-09 ## 6 Carter 1981-01-20 1977-01-20 ## 7 Reagan 1989-01-20 1981-01-20 ## 8 Bush 1993-01-20 1989-01-20 ## 9 Clinton 2001-01-20 1993-01-20 ## 10 Bush 2009-01-20 2001-01-20 ## 11 Obama 2017-01-20 2009-01-20 ``` --- We can also easily **rename** variables by simply providing a new name within the function. ```r select(dat,president=name, startdate=start, enddate=end) ``` ``` ## # A tibble: 11 x 3 ## president startdate enddate ## <chr> <date> <date> ## 1 Eisenhower 1953-01-20 1961-01-20 ## 2 Kennedy 1961-01-20 1963-11-22 ## 3 Johnson 1963-11-22 1969-01-20 ## 4 Nixon 1969-01-20 1974-08-09 ## 5 Ford 1974-08-09 1977-01-20 ## 6 Carter 1977-01-20 1981-01-20 ## 7 Reagan 1981-01-20 1989-01-20 ## 8 Bush 1989-01-20 1993-01-20 ## 9 Clinton 1993-01-20 2001-01-20 ## 10 Bush 2001-01-20 2009-01-20 ## 11 Obama 2009-01-20 2017-01-20 ``` --- Lastly, `select()` offers us a convenient way to drop variables by using the same logic that we employed with putting a **negative sign** in front of a dimension. The only difference here is that we can do the same but with a variable name. Here we **drop** the `start` date variable. ```r select(dat,-start) ``` ``` ## # A tibble: 11 x 3 ## name end party ## <chr> <date> <chr> ## 1 Eisenhower 1961-01-20 Republican ## 2 Kennedy 1963-11-22 Democratic ## 3 Johnson 1969-01-20 Democratic ## 4 Nixon 1974-08-09 Republican ## 5 Ford 1977-01-20 Republican ## 6 Carter 1981-01-20 Democratic ## 7 Reagan 1989-01-20 Republican ## 8 Bush 1993-01-20 Republican ## 9 Clinton 2001-01-20 Democratic ## 10 Bush 2009-01-20 Republican ## 11 Obama 2017-01-20 Democratic ``` --- ### Other useful `select()` behavior --- ### Other useful `select()` behavior - `contains()` - extract columns with a specific naming convention. ```r select(dat,contains("a")) ``` ``` ## # A tibble: 11 x 3 ## name start party ## <chr> <date> <chr> ## 1 Eisenhower 1953-01-20 Republican ## 2 Kennedy 1961-01-20 Democratic ## 3 Johnson 1963-11-22 Democratic ## 4 Nixon 1969-01-20 Republican ## 5 Ford 1974-08-09 Republican ## 6 Carter 1977-01-20 Democratic ## 7 Reagan 1981-01-20 Republican ## 8 Bush 1989-01-20 Republican ## 9 Clinton 1993-01-20 Democratic ## 10 Bush 2001-01-20 Republican ## 11 Obama 2009-01-20 Democratic ``` --- ### Other useful `select()` behavior - `contains()` - extract columns with a specific naming convention. - `starts_with()` - extract columns that start with a specific naming convention. ```r select(dat,starts_with("s")) ``` ``` ## # A tibble: 11 x 1 ## start ## <date> ## 1 1953-01-20 ## 2 1961-01-20 ## 3 1963-11-22 ## 4 1969-01-20 ## 5 1974-08-09 ## 6 1977-01-20 ## 7 1981-01-20 ## 8 1989-01-20 ## 9 1993-01-20 ## 10 2001-01-20 ## 11 2009-01-20 ``` --- ### Other useful `select()` behavior - `contains()` - extract columns with a specific naming convention. - `starts_with()` - extract columns that start with a specific naming convention. - `ends_with()` - extract columns that end with a specific naming convention. ```r select(dat,ends_with("d")) ``` ``` ## # A tibble: 11 x 1 ## end ## <date> ## 1 1961-01-20 ## 2 1963-11-22 ## 3 1969-01-20 ## 4 1974-08-09 ## 5 1977-01-20 ## 6 1981-01-20 ## 7 1989-01-20 ## 8 1993-01-20 ## 9 2001-01-20 ## 10 2009-01-20 ## 11 2017-01-20 ``` --- ### Other useful `select()` behavior - `contains()` - extract columns with a specific naming convention. - `starts_with()` - extract columns that start with a specific naming convention. - `ends_with()` - extract columns that end with a specific naming convention. - `everything()` - extract every remaining column not yet stated in the selection. ```r select(dat,start,end,everything()) ``` ``` ## # A tibble: 11 x 4 ## start end name party ## <date> <date> <chr> <chr> ## 1 1953-01-20 1961-01-20 Eisenhower Republican ## 2 1961-01-20 1963-11-22 Kennedy Democratic ## 3 1963-11-22 1969-01-20 Johnson Democratic ## 4 1969-01-20 1974-08-09 Nixon Republican ## 5 1974-08-09 1977-01-20 Ford Republican ## 6 1977-01-20 1981-01-20 Carter Democratic ## 7 1981-01-20 1989-01-20 Reagan Republican ## 8 1989-01-20 1993-01-20 Bush Republican ## 9 1993-01-20 2001-01-20 Clinton Democratic ## 10 2001-01-20 2009-01-20 Bush Republican ## 11 2009-01-20 2017-01-20 Obama Democratic ``` --- ### Other useful `select()` behavior - `contains()` - extract columns with a specific naming convention. - `starts_with()` - extract columns that start with a specific naming convention. - `ends_with()` - extract columns that end with a specific naming convention. - `everything()` - extract every remaining column not yet stated in the selection. - `matches()` - extract columns using a regular expression. ```r select(dat,matches("^s")) ``` ``` ## # A tibble: 11 x 1 ## start ## <date> ## 1 1953-01-20 ## 2 1961-01-20 ## 3 1963-11-22 ## 4 1969-01-20 ## 5 1974-08-09 ## 6 1977-01-20 ## 7 1981-01-20 ## 8 1989-01-20 ## 9 1993-01-20 ## 10 2001-01-20 ## 11 2009-01-20 ``` --- ## `filter()` ![:space 3] .center[<img src="Figures/filter.png">] --- ![:space 10] ```r filter(dat,party == "Republican") ``` ``` ## # A tibble: 6 x 4 ## name start end party ## <chr> <date> <date> <chr> ## 1 Eisenhower 1953-01-20 1961-01-20 Republican ## 2 Nixon 1969-01-20 1974-08-09 Republican ## 3 Ford 1974-08-09 1977-01-20 Republican ## 4 Reagan 1981-01-20 1989-01-20 Republican ## 5 Bush 1989-01-20 1993-01-20 Republican ## 6 Bush 2001-01-20 2009-01-20 Republican ``` --- ## `arrange()` ![:space 3] .center[<img src="Figures/arrange.png">] --- ![:space 10] ```r arrange(dat,party) ``` ``` ## # A tibble: 11 x 4 ## name start end party ## <chr> <date> <date> <chr> ## 1 Kennedy 1961-01-20 1963-11-22 Democratic ## 2 Johnson 1963-11-22 1969-01-20 Democratic ## 3 Carter 1977-01-20 1981-01-20 Democratic ## 4 Clinton 1993-01-20 2001-01-20 Democratic ## 5 Obama 2009-01-20 2017-01-20 Democratic ## 6 Eisenhower 1953-01-20 1961-01-20 Republican ## 7 Nixon 1969-01-20 1974-08-09 Republican ## 8 Ford 1974-08-09 1977-01-20 Republican ## 9 Reagan 1981-01-20 1989-01-20 Republican ## 10 Bush 1989-01-20 1993-01-20 Republican ## 11 Bush 2001-01-20 2009-01-20 Republican ``` --- `arrange()` with the internal function `desc()` can change to a **descending** ordering. ```r arrange(dat,desc(start)) ``` ``` ## # A tibble: 11 x 4 ## name start end party ## <chr> <date> <date> <chr> ## 1 Obama 2009-01-20 2017-01-20 Democratic ## 2 Bush 2001-01-20 2009-01-20 Republican ## 3 Clinton 1993-01-20 2001-01-20 Democratic ## 4 Bush 1989-01-20 1993-01-20 Republican ## 5 Reagan 1981-01-20 1989-01-20 Republican ## 6 Carter 1977-01-20 1981-01-20 Democratic ## 7 Ford 1974-08-09 1977-01-20 Republican ## 8 Nixon 1969-01-20 1974-08-09 Republican ## 9 Johnson 1963-11-22 1969-01-20 Democratic ## 10 Kennedy 1961-01-20 1963-11-22 Democratic ## 11 Eisenhower 1953-01-20 1961-01-20 Republican ``` --- ## `mutate()` ![:space 3] .center[<img src="Figures/mutate.png">] --- ![:space 10] ```r mutate(dat, # in office during cold war CW = start <= '1990-03-11') ``` ``` ## # A tibble: 11 x 5 ## name start end party CW ## <chr> <date> <date> <chr> <lgl> ## 1 Eisenhower 1953-01-20 1961-01-20 Republican TRUE ## 2 Kennedy 1961-01-20 1963-11-22 Democratic TRUE ## 3 Johnson 1963-11-22 1969-01-20 Democratic TRUE ## 4 Nixon 1969-01-20 1974-08-09 Republican TRUE ## 5 Ford 1974-08-09 1977-01-20 Republican TRUE ## 6 Carter 1977-01-20 1981-01-20 Democratic TRUE ## 7 Reagan 1981-01-20 1989-01-20 Republican TRUE ## 8 Bush 1989-01-20 1993-01-20 Republican TRUE ## 9 Clinton 1993-01-20 2001-01-20 Democratic FALSE ## 10 Bush 2001-01-20 2009-01-20 Republican FALSE ## 11 Obama 2009-01-20 2017-01-20 Democratic FALSE ``` --- `mutate()` also allows us to **_instantly_** use variables we just created. ```r mutate(dat, CW = start <= '1990-03-11', CW = as.numeric(CW)) ``` ``` ## # A tibble: 11 x 5 ## name start end party CW ## <chr> <date> <date> <chr> <dbl> ## 1 Eisenhower 1953-01-20 1961-01-20 Republican 1 ## 2 Kennedy 1961-01-20 1963-11-22 Democratic 1 ## 3 Johnson 1963-11-22 1969-01-20 Democratic 1 ## 4 Nixon 1969-01-20 1974-08-09 Republican 1 ## 5 Ford 1974-08-09 1977-01-20 Republican 1 ## 6 Carter 1977-01-20 1981-01-20 Democratic 1 ## 7 Reagan 1981-01-20 1989-01-20 Republican 1 ## 8 Bush 1989-01-20 1993-01-20 Republican 1 ## 9 Clinton 1993-01-20 2001-01-20 Democratic 0 ## 10 Bush 2001-01-20 2009-01-20 Republican 0 ## 11 Obama 2009-01-20 2017-01-20 Democratic 0 ``` --- Like `mutate()`, `transmute()` provides a method for generating a new variable, but unlike the former, it **returns only the newly created variable**. ```r transmute(dat,CW = start <= '1990-03-11') ``` ``` ## # A tibble: 11 x 1 ## CW ## <lgl> ## 1 TRUE ## 2 TRUE ## 3 TRUE ## 4 TRUE ## 5 TRUE ## 6 TRUE ## 7 TRUE ## 8 TRUE ## 9 FALSE ## 10 FALSE ## 11 FALSE ``` --- ## `summarize()` ![:space 15] ```r summarize(dat, days_in_office = mean(end-start), max = max(end-start), min = min(end-start)) ``` ``` ## # A tibble: 1 x 3 ## days_in_office max min ## <drtn> <drtn> <drtn> ## 1 2125.091 days 2922 days 895 days ``` --- ![:space 10] There are a number of internal functions that can be used with `mutate()`, `transmute()`, and `summarize()`. - `n()` counts the number of observations - `n_distinct()` counts the number of distinct entries ![:space 10] ```r summarize(dat,N=n(),N_party=n_distinct(party)) ``` ``` ## # A tibble: 1 x 2 ## N N_party ## <int> <int> ## 1 11 2 ``` --- ## `group_by()` ![:space 3] .center[<img src="Figures/group_by.png">] --- ![:space 10] When used in conjunction with some of the other functions, `group_by()` becomes a powerful to perform by cluster/unit/group operations. ![:space 5] ```r # group by party x <- group_by(dat,party) summarize(x,min_in_office = min(end-start)) ``` ``` ## # A tibble: 2 x 2 ## party min_in_office ## <chr> <drtn> ## 1 Democratic 1036 days ## 2 Republican 895 days ``` --- ### Other useful `tidyverse` functions - `count()`/`add_count()` ```r count(dat, party) ``` ``` ## # A tibble: 2 x 2 ## party n ## <chr> <int> ## 1 Democratic 5 ## 2 Republican 6 ``` ```r add_count(dat, party) ``` ``` ## # A tibble: 11 x 5 ## name start end party n ## <chr> <date> <date> <chr> <int> ## 1 Eisenhower 1953-01-20 1961-01-20 Republican 6 ## 2 Kennedy 1961-01-20 1963-11-22 Democratic 5 ## 3 Johnson 1963-11-22 1969-01-20 Democratic 5 ## 4 Nixon 1969-01-20 1974-08-09 Republican 6 ## 5 Ford 1974-08-09 1977-01-20 Republican 6 ## 6 Carter 1977-01-20 1981-01-20 Democratic 5 ## 7 Reagan 1981-01-20 1989-01-20 Republican 6 ## 8 Bush 1989-01-20 1993-01-20 Republican 6 ## 9 Clinton 1993-01-20 2001-01-20 Democratic 5 ## 10 Bush 2001-01-20 2009-01-20 Republican 6 ## 11 Obama 2009-01-20 2017-01-20 Democratic 5 ``` --- ### Other useful `tidyverse` functions - `count()`/`add_count()` - `recode()` ```r mutate(dat,party = recode(party,'Republican'=1,'Democratic'=0)) ``` ``` ## # A tibble: 11 x 4 ## name start end party ## <chr> <date> <date> <dbl> ## 1 Eisenhower 1953-01-20 1961-01-20 1 ## 2 Kennedy 1961-01-20 1963-11-22 0 ## 3 Johnson 1963-11-22 1969-01-20 0 ## 4 Nixon 1969-01-20 1974-08-09 1 ## 5 Ford 1974-08-09 1977-01-20 1 ## 6 Carter 1977-01-20 1981-01-20 0 ## 7 Reagan 1981-01-20 1989-01-20 1 ## 8 Bush 1989-01-20 1993-01-20 1 ## 9 Clinton 1993-01-20 2001-01-20 0 ## 10 Bush 2001-01-20 2009-01-20 1 ## 11 Obama 2009-01-20 2017-01-20 0 ``` --- ### Other useful `tidyverse` functions - `count()`/`add_count()` - `recode()` - `pull()` ```r pull(dat,party) ``` ``` ## [1] "Republican" "Democratic" "Democratic" "Republican" "Republican" ## [6] "Democratic" "Republican" "Republican" "Democratic" "Republican" ## [11] "Democratic" ``` --- ### Other useful `tidyverse` functions - `count()`/`add_count()` - `recode()` - `pull()` - `glimpse()` ```r glimpse(dat) ``` ``` ## Rows: 11 ## Columns: 4 ## $ name <chr> "Eisenhower", "Kennedy", "Johnson", "Nixon", "Ford", "Carter", … ## $ start <date> 1953-01-20, 1961-01-20, 1963-11-22, 1969-01-20, 1974-08-09, 19… ## $ end <date> 1961-01-20, 1963-11-22, 1969-01-20, 1974-08-09, 1977-01-20, 19… ## $ party <chr> "Republican", "Democratic", "Democratic", "Republican", "Republ… ``` --- ### Other useful `tidyverse` functions - `count()`/`add_count()` - `recode()` - `pull()` - `glimpse()` - `slice()` ```r slice(dat,2:3) ``` ``` ## # A tibble: 2 x 4 ## name start end party ## <chr> <date> <date> <chr> ## 1 Kennedy 1961-01-20 1963-11-22 Democratic ## 2 Johnson 1963-11-22 1969-01-20 Democratic ``` --- ### Other useful `tidyverse` functions - `sample_n()`/`sample_frac()` ```r sample_n(dat,2) ``` ``` ## # A tibble: 2 x 4 ## name start end party ## <chr> <date> <date> <chr> ## 1 Kennedy 1961-01-20 1963-11-22 Democratic ## 2 Nixon 1969-01-20 1974-08-09 Republican ``` ```r sample_frac(dat,.25) ``` ``` ## # A tibble: 3 x 4 ## name start end party ## <chr> <date> <date> <chr> ## 1 Obama 2009-01-20 2017-01-20 Democratic ## 2 Reagan 1981-01-20 1989-01-20 Republican ## 3 Nixon 1969-01-20 1974-08-09 Republican ``` --- ### Other useful `tidyverse` functions - `sample_n()`/`sample_frac()` - `case_when()` ```r mutate(dat, pres_type = case_when( (end - start) == 1461 ~ "one-term pres.", (end - start) == 2922 ~ "two-term pres.", TRUE ~ "special case" ) ) ``` ``` ## # A tibble: 11 x 5 ## name start end party pres_type ## <chr> <date> <date> <chr> <chr> ## 1 Eisenhower 1953-01-20 1961-01-20 Republican two-term pres. ## 2 Kennedy 1961-01-20 1963-11-22 Democratic special case ## 3 Johnson 1963-11-22 1969-01-20 Democratic special case ## 4 Nixon 1969-01-20 1974-08-09 Republican special case ## 5 Ford 1974-08-09 1977-01-20 Republican special case ## 6 Carter 1977-01-20 1981-01-20 Democratic one-term pres. ## 7 Reagan 1981-01-20 1989-01-20 Republican two-term pres. ## 8 Bush 1989-01-20 1993-01-20 Republican one-term pres. ## 9 Clinton 1993-01-20 2001-01-20 Democratic two-term pres. ## 10 Bush 2001-01-20 2009-01-20 Republican two-term pres. ## 11 Obama 2009-01-20 2017-01-20 Democratic two-term pres. ``` --- ### Other useful `tidyverse` functions - `sample_n()`/`sample_frac()` - `case_when()` - `lag()`/`lead()` ```r mutate(dat, predecessor = lag(name,n=1,order_by=start) ) ``` ``` ## # A tibble: 11 x 5 ## name start end party predecessor ## <chr> <date> <date> <chr> <chr> ## 1 Eisenhower 1953-01-20 1961-01-20 Republican <NA> ## 2 Kennedy 1961-01-20 1963-11-22 Democratic Eisenhower ## 3 Johnson 1963-11-22 1969-01-20 Democratic Kennedy ## 4 Nixon 1969-01-20 1974-08-09 Republican Johnson ## 5 Ford 1974-08-09 1977-01-20 Republican Nixon ## 6 Carter 1977-01-20 1981-01-20 Democratic Ford ## 7 Reagan 1981-01-20 1989-01-20 Republican Carter ## 8 Bush 1989-01-20 1993-01-20 Republican Reagan ## 9 Clinton 1993-01-20 2001-01-20 Democratic Bush ## 10 Bush 2001-01-20 2009-01-20 Republican Clinton ## 11 Obama 2009-01-20 2017-01-20 Democratic Bush ``` --- ### Other useful `tidyverse` functions - `sample_n()`/`sample_frac()` - `case_when()` - `lag()`/`lead()` - `row_number()` ```r mutate(dat,id = row_number()) ``` ``` ## # A tibble: 11 x 5 ## name start end party id ## <chr> <date> <date> <chr> <int> ## 1 Eisenhower 1953-01-20 1961-01-20 Republican 1 ## 2 Kennedy 1961-01-20 1963-11-22 Democratic 2 ## 3 Johnson 1963-11-22 1969-01-20 Democratic 3 ## 4 Nixon 1969-01-20 1974-08-09 Republican 4 ## 5 Ford 1974-08-09 1977-01-20 Republican 5 ## 6 Carter 1977-01-20 1981-01-20 Democratic 6 ## 7 Reagan 1981-01-20 1989-01-20 Republican 7 ## 8 Bush 1989-01-20 1993-01-20 Republican 8 ## 9 Clinton 1993-01-20 2001-01-20 Democratic 9 ## 10 Bush 2001-01-20 2009-01-20 Republican 10 ## 11 Obama 2009-01-20 2017-01-20 Democratic 11 ``` --- ### Other useful `tidyverse` functions - `sample_n()`/`sample_frac()` - `case_when()` - `lag()`/`lead()` - `row_number()` - `rowname_to_column()`/`column_to_rownames()` ```r rownames_to_column(dat,'id') ``` ``` ## # A tibble: 11 x 5 ## id name start end party ## <chr> <chr> <date> <date> <chr> ## 1 1 Eisenhower 1953-01-20 1961-01-20 Republican ## 2 2 Kennedy 1961-01-20 1963-11-22 Democratic ## 3 3 Johnson 1963-11-22 1969-01-20 Democratic ## 4 4 Nixon 1969-01-20 1974-08-09 Republican ## 5 5 Ford 1974-08-09 1977-01-20 Republican ## 6 6 Carter 1977-01-20 1981-01-20 Democratic ## 7 7 Reagan 1981-01-20 1989-01-20 Republican ## 8 8 Bush 1989-01-20 1993-01-20 Republican ## 9 9 Clinton 1993-01-20 2001-01-20 Democratic ## 10 10 Bush 2001-01-20 2009-01-20 Republican ## 11 11 Obama 2009-01-20 2017-01-20 Democratic ``` --- ### Other useful `tidyverse` functions - `sample_n()`/`sample_frac()` - `case_when()` - `lag()`/`lead()` - `row_number()` - `rowname_to_column()`/`column_to_rownames()` ```r column_to_rownames(head(dat,3),var="name") ``` ``` ## start end party ## Eisenhower 1953-01-20 1961-01-20 Republican ## Kennedy 1961-01-20 1963-11-22 Democratic ## Johnson 1963-11-22 1969-01-20 Democratic ``` --- ### Cheating is allowed ![:space 10] [**Download a cheatsheet**](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf) to containing an entire list of the tidyverse `dplyr` verbs. ![:space 5] .center[OR] ![:space 5] Access any cheatsheet for the RStudio ecosystem of packages directly from the IDE. .center[`Help > Cheatsheets`] --- class:newsection # Piping --- ## Combining `dplyr` functions When we need to do a series of manipulations, we can **perform each manipulation individually and save each entry as a new object** that we write over. ```r x <- filter(presidential,party=='Republican') x <- group_by(x,name) x <- transmute(x,t_in_office = end-start) x <- arrange(x,t_in_office) x ``` ``` ## # A tibble: 6 x 2 ## # Groups: name [5] ## name t_in_office ## <chr> <drtn> ## 1 Ford 895 days ## 2 Bush 1461 days ## 3 Nixon 2027 days ## 4 Eisenhower 2922 days ## 5 Reagan 2922 days ## 6 Bush 2922 days ``` --- Or we can **nest** functions _within_ each other. ```r arrange( transmute( group_by( filter(presidential,party=='Republican'),name), t_in_office = end-start), t_in_office) ``` ``` ## # A tibble: 6 x 2 ## # Groups: name [5] ## name t_in_office ## <chr> <drtn> ## 1 Ford 895 days ## 2 Bush 1461 days ## 3 Nixon 2027 days ## 4 Eisenhower 2922 days ## 5 Reagan 2922 days ## 6 Bush 2922 days ``` -- The issue with **nesting functions** is that it is **![:text_color darkred](difficult to [1] read and [2] detect a mistake!)** --- ### Piping Functions The **pipe** is a useful tool that allows us to **pass** output from one function to the next. To pipe we write **`%>%`** _in-between_ each function. .center[`data %>% function1() %>% function2()`] We pass the output to specific locations in the proceeding function using the pointer `.` .center[`data %>% function1(arg = .)`] Piping offers is a clean way of manipulating data that is **intuitive and easy to read**. --- Here we **pass** our `data` to `filter()` then to `group_by()` then to `transmute()` and then finally to `arrange()` which returns our output! ```r presidential %>% filter(party=='Republican') %>% group_by(name) %>% transmute(t_in_office = end-start) %>% arrange(t_in_office) ``` ``` ## # A tibble: 6 x 2 ## # Groups: name [5] ## name t_in_office ## <chr> <drtn> ## 1 Ford 895 days ## 2 Bush 1461 days ## 3 Nixon 2027 days ## 4 Eisenhower 2922 days ## 5 Reagan 2922 days ## 6 Bush 2922 days ``` --- ### Two things to keep in mind when piping... ![:space 15] 1. Functions **_must_** be linked with `%>%` 2. When functions have **multiple arguments**, point to where the data should go with a period (`.`) ```r data %>% function(arg1= ., arg2=TRUE) ``` --- class:newsection # Joining Data --- ## `_join` functions (dplyr) ![:space 5] `dplyr` offers a range of joining/merging functions that are more intuitive to use. These functions provide a **SQL framework** that is easier to read and more efficient. - `left_join()` - `right_join()` - `inner_join()` - `full_join()` - `anti_join()` When joining data, you must have a **unique** identifier on the dimension you're matching on. --- ![:space 10] Consider the following two example datasets... ```r data_A ``` ``` ## country Var1 ## 1 Nigeria 4 ## 2 England 3 ## 3 Botswana 6 ``` ```r data_B ``` ``` ## country Var2 ## 1 Nigeria Low ## 2 United States High ## 3 Botswana Medium ``` --- ## `left_join()` .center[<img src="Figures/left_join.png", height=100px>] .center[<img src="Figures/left-join.gif", height=400px>] --- ## `left_join()` .center[<img src="Figures/left_join.png", height=100px>] ![:space 10] ```r left_join(data_A,data_B,by="country") ``` ``` ## country Var1 Var2 ## 1 Nigeria 4 Low ## 2 England 3 <NA> ## 3 Botswana 6 Medium ``` --- ## `right_join()` .center[<img src="Figures/right_join.png", height=100px>] .center[<img src="Figures/right-join.gif", height=400px>] --- ## `right_join()` .center[<img src="Figures/right_join.png", height=100px>] ![:space 10] ```r right_join(data_A,data_B,by="country") ``` ``` ## country Var1 Var2 ## 1 Nigeria 4 Low ## 2 Botswana 6 Medium ## 3 United States NA High ``` --- ## `inner_join()` .center[<img src="Figures/inner_join.png", height=100px>] .center[<img src="Figures/inner-join.gif", height=400px>] --- ## `inner_join()` .center[<img src="Figures/inner_join.png", height=100px>] ![:space 10] ```r inner_join(data_A,data_B,by="country") ``` ``` ## country Var1 Var2 ## 1 Nigeria 4 Low ## 2 Botswana 6 Medium ``` --- ### `full_join()` .center[<img src="Figures/full_join.png", height=100px>] .center[<img src="Figures/full-join.gif", height=400px>] --- ### `full_join()` <center> <img src="Figures/full_join.png", height=100px> </center> ![:space 10] ```r full_join(data_A,data_B,by="country") ``` ``` ## country Var1 Var2 ## 1 Nigeria 4 Low ## 2 England 3 <NA> ## 3 Botswana 6 Medium ## 4 United States NA High ``` --- ### `anti_join()` .center[<img src="Figures/anti_join_left.png", height=100px>] .center[<img src="Figures/anti-join.gif", height=400px>] --- ### `anti_join()` .center[<img src="Figures/anti_join_left.png", height=100px>] ![:space 10] ```r anti_join(data_A,data_B,by="country") ``` ``` ## country Var1 ## 1 England 3 ``` --- ## `bind_rows()` <center> <img src="Figures/rbind.png"> </center> ![:space 5] ```r bind_rows(data_A,data_B) ``` ``` ## country Var1 Var2 ## 1 Nigeria 4 <NA> ## 2 England 3 <NA> ## 3 Botswana 6 <NA> ## 4 Nigeria NA Low ## 5 United States NA High ## 6 Botswana NA Medium ``` --- ## `bind_cols()` <center> <img src="Figures/cbind.png"> </center> ![:space 5] ```r bind_cols(data_A,data_B) ``` ``` ## country...1 Var1 country...3 Var2 ## 1 Nigeria 4 Nigeria Low ## 2 England 3 United States High ## 3 Botswana 6 Botswana Medium ``` --- ## Disparate column names ![:space 10] Sometimes the naming conventions of two datasets don't perfectly align. When this happens, we can specify how data merges onto one another more explicitly using the `by=` argument. Moreover, we can merge on **_more_ than one dimension** by specifying all relevant column names. --- ![:space 10] Once again, consider the following example data.. ```r data_A ``` ``` ## country year Var1 ## 1 Nigeria 1999 4 ## 2 England 2001 3 ## 3 Botswana 2000 6 ``` ```r data_B ``` ``` ## country_name year Var2 ## 1 Nigeria 1999 Low ## 2 United States 2004 High ## 3 Botswana 2003 Medium ``` --- ![:space 20] ```r full_join(data_A,data_B, by=c('country'='country_name', 'year')) ``` ``` ## country year Var1 Var2 ## 1 Nigeria 1999 4 Low ## 2 England 2001 3 <NA> ## 3 Botswana 2000 6 <NA> ## 4 United States 2004 NA High ## 5 Botswana 2003 NA Medium ``` --- class:newsection # Reshaping Data --- ![:space 10] Often, we need to alter the structure of a `data.frame` from a **wide format**... ![:space 10] .center[ <table> <thead> <tr> <th style="text-align:center;"> country </th> <th style="text-align:center;"> 1992 </th> <th style="text-align:center;"> 1993 </th> <th style="text-align:center;"> 1994 </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> Nigeria </td> <td style="text-align:center;"> 9.72 </td> <td style="text-align:center;"> 10.06 </td> <td style="text-align:center;"> 9.66 </td> </tr> <tr> <td style="text-align:center;"> Iran </td> <td style="text-align:center;"> 9.88 </td> <td style="text-align:center;"> 10.86 </td> <td style="text-align:center;"> 9.78 </td> </tr> <tr> <td style="text-align:center;"> Cambodia </td> <td style="text-align:center;"> 10.78 </td> <td style="text-align:center;"> 10.23 </td> <td style="text-align:center;"> 10.61 </td> </tr> <tr> <td style="text-align:center;"> Australia </td> <td style="text-align:center;"> 10.04 </td> <td style="text-align:center;"> 9.37 </td> <td style="text-align:center;"> 10.18 </td> </tr> </tbody> </table> ] --- ...into a **long format** .center[ <table> <thead> <tr> <th style="text-align:center;"> country </th> <th style="text-align:center;"> year </th> <th style="text-align:center;"> var </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> Nigeria </td> <td style="text-align:center;"> 1992 </td> <td style="text-align:center;"> 9.72 </td> </tr> <tr> <td style="text-align:center;"> Nigeria </td> <td style="text-align:center;"> 1993 </td> <td style="text-align:center;"> 10.06 </td> </tr> <tr> <td style="text-align:center;"> Nigeria </td> <td style="text-align:center;"> 1994 </td> <td style="text-align:center;"> 9.66 </td> </tr> <tr> <td style="text-align:center;"> Iran </td> <td style="text-align:center;"> 1992 </td> <td style="text-align:center;"> 9.88 </td> </tr> <tr> <td style="text-align:center;"> Iran </td> <td style="text-align:center;"> 1993 </td> <td style="text-align:center;"> 10.86 </td> </tr> <tr> <td style="text-align:center;"> Iran </td> <td style="text-align:center;"> 1994 </td> <td style="text-align:center;"> 9.78 </td> </tr> <tr> <td style="text-align:center;"> Cambodia </td> <td style="text-align:center;"> 1992 </td> <td style="text-align:center;"> 10.78 </td> </tr> <tr> <td style="text-align:center;"> Cambodia </td> <td style="text-align:center;"> 1993 </td> <td style="text-align:center;"> 10.23 </td> </tr> <tr> <td style="text-align:center;"> Cambodia </td> <td style="text-align:center;"> 1994 </td> <td style="text-align:center;"> 10.61 </td> </tr> <tr> <td style="text-align:center;"> Australia </td> <td style="text-align:center;"> 1992 </td> <td style="text-align:center;"> 10.04 </td> </tr> <tr> <td style="text-align:center;"> Australia </td> <td style="text-align:center;"> 1993 </td> <td style="text-align:center;"> 9.37 </td> </tr> <tr> <td style="text-align:center;"> Australia </td> <td style="text-align:center;"> 1994 </td> <td style="text-align:center;"> 10.18 </td> </tr> </tbody> </table> ] --- ![:space 20] .pull-left[ ![:center_img 85](Figures/tidyr_logo.png) ] .pull-right[ `tidyr` is a tidyverse package built to help reshape data. The package contains an array of functions that are all useful cleaning a data construct. `tidyr` eases tasks such as: - dropping missing values - filling missing values - separating a column into two variables or uniting two columns into one ] --- ### `pivot_longer()`: from wide-to-long ![:space 15] ![:center_img 100](Figures/gather.png) --- ### `pivot_longer()`: from wide-to-long ![:space 10] ```r dat ``` ``` ## country 1992 1993 1994 ## 1 Nigeria 9.72 10.06 9.66 ## 2 Iran 9.88 10.86 9.78 ## 3 Cambodia 10.78 10.23 10.61 ## 4 Australia 10.04 9.37 10.18 ``` --- ### `pivot_longer()`: from wide-to-long Specify the columns you want to expand ```r dat %>% pivot_longer(cols="1992":"1994") ``` ``` ## # A tibble: 12 x 3 ## country name value ## <fct> <chr> <dbl> ## 1 Nigeria 1992 9.72 ## 2 Nigeria 1993 10.1 ## 3 Nigeria 1994 9.66 ## 4 Iran 1992 9.88 ## 5 Iran 1993 10.9 ## 6 Iran 1994 9.78 ## 7 Cambodia 1992 10.8 ## 8 Cambodia 1993 10.2 ## 9 Cambodia 1994 10.6 ## 10 Australia 1992 10.0 ## 11 Australia 1993 9.37 ## 12 Australia 1994 10.2 ``` --- ### `pivot_longer()`: from wide-to-long Rename the expanded columns ```r dat %>% pivot_longer(cols="1992":"1994", names_to = "year", values_to = "ln_gdppc") ``` ``` ## # A tibble: 12 x 3 ## country year ln_gdppc ## <fct> <chr> <dbl> ## 1 Nigeria 1992 9.72 ## 2 Nigeria 1993 10.1 ## 3 Nigeria 1994 9.66 ## 4 Iran 1992 9.88 ## 5 Iran 1993 10.9 ## 6 Iran 1994 9.78 ## 7 Cambodia 1992 10.8 ## 8 Cambodia 1993 10.2 ## 9 Cambodia 1994 10.6 ## 10 Australia 1992 10.0 ## 11 Australia 1993 9.37 ## 12 Australia 1994 10.2 ``` --- ### `pivot_longer()`: from wide-to-long ![:space 10] Be selective about which columns are expanded and which are not. ```r # variables can be excluded from the reshape dat %>% pivot_longer(cols="1992", names_to = "year", values_to = "ln_gdppc") ``` ``` ## # A tibble: 4 x 5 ## country `1993` `1994` year ln_gdppc ## <fct> <dbl> <dbl> <chr> <dbl> ## 1 Nigeria 10.1 9.66 1992 9.72 ## 2 Iran 10.9 9.78 1992 9.88 ## 3 Cambodia 10.2 10.6 1992 10.8 ## 4 Australia 9.37 10.2 1992 10.0 ``` --- ## `pivot_wider()`: from long-to-wide ![:space 10] .center[<img src ="Figures/spread.png">] --- ## `pivot_wider()`: from long-to-wide ![:space 10] ```r dat ``` ``` ## country year ln_gdppc ## 1 Nigeria 1992 9.72 ## 3 Cambodia 1992 10.78 ## 4 Australia 1992 10.04 ## 5 Nigeria 1993 10.06 ## 6 Iran 1993 10.86 ## 8 Australia 1993 9.37 ## 9 Nigeria 1994 9.66 ## 10 Iran 1994 9.78 ## 11 Cambodia 1994 10.61 ## 12 Australia 1994 10.18 ``` --- ## `pivot_wider()`: from long-to-wide Main arguments: - `names_from`: name of the variable that will be spread out into columns. - `values_from`: values that will populate the cells of each column ![:space 3] ```r dat %>% pivot_wider(names_from = year, values_from = ln_gdppc) ``` ``` ## # A tibble: 4 x 4 ## country `1992` `1993` `1994` ## <fct> <dbl> <dbl> <dbl> ## 1 Nigeria 9.72 10.1 9.66 ## 2 Cambodia 10.8 NA 10.6 ## 3 Australia 10.0 9.37 10.2 ## 4 Iran NA 10.9 9.78 ``` --- ## `pivot_wider()`: from long-to-wide Main arguments: - `names_from`: name of the variable that will be spread out into columns. - `values_from`: values that will populate the cells of each column. - `values_fill`: specify what value a missing value should take on. ```r dat %>% pivot_wider(names_from = year, values_from = ln_gdppc, values_fill = list(ln_gdppc = -99)) ``` ``` ## # A tibble: 4 x 4 ## country `1992` `1993` `1994` ## <fct> <dbl> <dbl> <dbl> ## 1 Nigeria 9.72 10.1 9.66 ## 2 Cambodia 10.8 -99 10.6 ## 3 Australia 10.0 9.37 10.2 ## 4 Iran -99 10.9 9.78 ``` --- ## Legacy code ![:space 5] Note that `pivot_wider()` and `pivot_longer()` are relatively new wrappers for the older `gather()` and `spread()` functions. ![:space 5] What does this mean for you? - Answers online and in the reading on how to reshape data may differ depending on the preference of the source. - Use whatever versions of the function that work best for you. - Many find that `pivot_wider()` and `pivot_longer()` are easier to use than the older `gather()` and `spread()` functions.