PPOL670 | Introduction to Data Science for Public Policy Week 4 Data Wrangling

# <font class = "title-panel"> PPOL670 | Introduction to Data Science for Public Policy </font> <font size=6, face="bold"> Week 4 </font> <br> <br> <font size=100, face="bold"> Data Wrangling </font>
### <font class = "title-footer">  Prof. Eric Dunford  ◆  Georgetown University  ◆  McCourt School of Public Policy  ◆  <a href="mailto:eric.dunford@georgetown.edu" class="email">eric.dunford@georgetown.edu</a></font>

---

<div class="slide-footer"><span> 
PPOL670 | Introduction to Data Science for Public Policy

&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;

Week 4

&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;

Data Wrangling

</span></div>

---
class: outline

# Outline for Today

![:space 5]

90% of data science is **data wrangling**, which involves...

- **manipulating**

- **piping**

- **joining**

- **reshaping**

...data!

---

# Data Wrangling

---

# What is data wrangling?

![:space 5]

- **raw &rarr; processed**: the process of transforming data from one format to another.

- **converting the structure** to facilitate some analysis
  + **_altering the unit of analysis_**: going from individuals in a state in a given year to state-year by
  
  + changing from a **_"wide"_** (many columns, few rows) **_to a "long" structure_** (few colums, many rows)
  
  + **_summarizing data_** across specific subgroups

---

## Thinking about Data Generation

- **Data rarely works "out of the box"**

+ Constructed for a different purpose than what you're using it for
  
  + Requires adding additional information:
      - merging other data into it
      - creating new variables from old ones
      - changing the unit of analysis
      
--

- **Codebooks offer clues, but not the story**

+ If a codebook exists, it can outline the data generation process, variable names, issues with the data, etc. 
  
  + Codebooks can't won't reveal underlying issues or biases in the data. locating these requires exploration through summarization, visualization, and reshaping the data structure.

---

## Thinking about Data Generation
 
 ![:space 5]
 
- **Data Provenance**

+ Data changes hands, edits are made, variables/observations are accidentally dropped. Without probing, one might overlook flaws.

- **Automated Data**

+ Scraping data can ease collection, but difficult to know where/if the process broke down. 
  
  + Manipulating data can reveal algorithmic failure/limitations. 
  
  + Never assume "everything worked"
 
---

## `tidyverse` approach

Throughout this course, we are going to cover the basics of data manipulation and visualization in `R`. By focusing on a suite of packages known as the "[tidyverse](https://www.tidyverse.org)".

These packages were designed to ease the process of data manipulation and management so that it is more intuitive, efficient, and interpretable.

Specifically, the `tidyverse` is a housing package that holds the following packages:
- [readr](http://readr.tidyverse.org/) - for reading data in 
- [tibble](https://tibble.tidyverse.org/) - for "tidy" data structures
- [dplyr](http://dplyr.tidyverse.org/) - for data manipulation
- [ggplot2](http://ggplot2.tidyverse.org/) - for data visualization
- [tidyr](http://tidyr.tidyverse.org/) - for cleaning
- [purrr](http://purrr.tidyverse.org/) - functional programming toolkit

---

## `tidyverse` advantage

![:space 15]

Most of everything we will do will require in data manipulation just **one** package

```r
# Install tidyverse package
install.packages('tidyverse')

# Load the package
require(tidyverse)
```
  
---

class:newsection

# Tidy Data

---

The same data can be represented in many different ways...

![:space 15]

.center[
<table>
 <thead>
  <tr>
   <th style="text-align:center;"> country </th>
   <th style="text-align:center;"> year </th>
   <th style="text-align:center;"> cases </th>
   <th style="text-align:center;"> population </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> Afghanistan </td>
   <td style="text-align:center;"> 1999 </td>
   <td style="text-align:center;"> 745 </td>
   <td style="text-align:center;"> 19987071 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Afghanistan </td>
   <td style="text-align:center;"> 2000 </td>
   <td style="text-align:center;"> 2666 </td>
   <td style="text-align:center;"> 20595360 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Brazil </td>
   <td style="text-align:center;"> 1999 </td>
   <td style="text-align:center;"> 37737 </td>
   <td style="text-align:center;"> 172006362 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Brazil </td>
   <td style="text-align:center;"> 2000 </td>
   <td style="text-align:center;"> 80488 </td>
   <td style="text-align:center;"> 174504898 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> China </td>
   <td style="text-align:center;"> 1999 </td>
   <td style="text-align:center;"> 212258 </td>
   <td style="text-align:center;"> 1272915272 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> China </td>
   <td style="text-align:center;"> 2000 </td>
   <td style="text-align:center;"> 213766 </td>
   <td style="text-align:center;"> 1280428583 </td>
  </tr>
</tbody>
</table>
]

---

<br><br>
.center[
<table>
 <thead>
  <tr>
   <th style="text-align:center;"> country </th>
   <th style="text-align:center;"> year </th>
   <th style="text-align:center;"> type </th>
   <th style="text-align:center;"> count </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> Afghanistan </td>
   <td style="text-align:center;"> 1999 </td>
   <td style="text-align:center;"> cases </td>
   <td style="text-align:center;"> 745 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Afghanistan </td>
   <td style="text-align:center;"> 1999 </td>
   <td style="text-align:center;"> population </td>
   <td style="text-align:center;"> 19987071 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Afghanistan </td>
   <td style="text-align:center;"> 2000 </td>
   <td style="text-align:center;"> cases </td>
   <td style="text-align:center;"> 2666 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Afghanistan </td>
   <td style="text-align:center;"> 2000 </td>
   <td style="text-align:center;"> population </td>
   <td style="text-align:center;"> 20595360 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Brazil </td>
   <td style="text-align:center;"> 1999 </td>
   <td style="text-align:center;"> cases </td>
   <td style="text-align:center;"> 37737 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Brazil </td>
   <td style="text-align:center;"> 1999 </td>
   <td style="text-align:center;"> population </td>
   <td style="text-align:center;"> 172006362 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Brazil </td>
   <td style="text-align:center;"> 2000 </td>
   <td style="text-align:center;"> cases </td>
   <td style="text-align:center;"> 80488 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Brazil </td>
   <td style="text-align:center;"> 2000 </td>
   <td style="text-align:center;"> population </td>
   <td style="text-align:center;"> 174504898 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> China </td>
   <td style="text-align:center;"> 1999 </td>
   <td style="text-align:center;"> cases </td>
   <td style="text-align:center;"> 212258 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> China </td>
   <td style="text-align:center;"> 1999 </td>
   <td style="text-align:center;"> population </td>
   <td style="text-align:center;"> 1272915272 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> China </td>
   <td style="text-align:center;"> 2000 </td>
   <td style="text-align:center;"> cases </td>
   <td style="text-align:center;"> 213766 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> China </td>
   <td style="text-align:center;"> 2000 </td>
   <td style="text-align:center;"> population </td>
   <td style="text-align:center;"> 1280428583 </td>
  </tr>
</tbody>
</table>
]

---

<br><br><br>
.center[
<table>
 <thead>
  <tr>
   <th style="text-align:center;"> country </th>
   <th style="text-align:center;"> year </th>
   <th style="text-align:center;"> rate </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> Afghanistan </td>
   <td style="text-align:center;"> 1999 </td>
   <td style="text-align:center;"> 745/19987071 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Afghanistan </td>
   <td style="text-align:center;"> 2000 </td>
   <td style="text-align:center;"> 2666/20595360 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Brazil </td>
   <td style="text-align:center;"> 1999 </td>
   <td style="text-align:center;"> 37737/172006362 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Brazil </td>
   <td style="text-align:center;"> 2000 </td>
   <td style="text-align:center;"> 80488/174504898 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> China </td>
   <td style="text-align:center;"> 1999 </td>
   <td style="text-align:center;"> 212258/1272915272 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> China </td>
   <td style="text-align:center;"> 2000 </td>
   <td style="text-align:center;"> 213766/1280428583 </td>
  </tr>
</tbody>
</table>
]

---

<br><br>
.center[
<table>
<caption>Cases</caption>
 <thead>
  <tr>
   <th style="text-align:center;"> country </th>
   <th style="text-align:center;"> 1999 </th>
   <th style="text-align:center;"> 2000 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> Afghanistan </td>
   <td style="text-align:center;"> 745 </td>
   <td style="text-align:center;"> 2666 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Brazil </td>
   <td style="text-align:center;"> 37737 </td>
   <td style="text-align:center;"> 80488 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> China </td>
   <td style="text-align:center;"> 212258 </td>
   <td style="text-align:center;"> 213766 </td>
  </tr>
</tbody>
</table>
]

<br>

.center[
<table>
<caption>Population</caption>
 <thead>
  <tr>
   <th style="text-align:center;"> country </th>
   <th style="text-align:center;"> 1999 </th>
   <th style="text-align:center;"> 2000 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> Afghanistan </td>
   <td style="text-align:center;"> 19987071 </td>
   <td style="text-align:center;"> 20595360 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Brazil </td>
   <td style="text-align:center;"> 172006362 </td>
   <td style="text-align:center;"> 174504898 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> China </td>
   <td style="text-align:center;"> 1272915272 </td>
   <td style="text-align:center;"> 1280428583 </td>
  </tr>
</tbody>
</table>
]

---

There are three interrelated rules which make a dataset **tidy**:
<br><br>
1. **Each variable must have its own column.**

2. **Each observation must have its own row.**

3. **Each value must have its own cell.**

---

## Tidy!

---

## `tibble()` data frames

![:space 5]

> "Tibbles _are_ data frames, but they tweak some older behaviours to make life a little easier." - Hadley Wickham

![:space 5]

Differences between `tibbles` and `data.frames`

1. Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data.

2. More explicit errors: tibbles are strict. They throw lots of errors, meaning we catch mistakes early.

---

# Manipulating Data<br>

---

![:space 20]

- The `dplyr` package (part of the `tidyverse`) offers an intuitive **_verb based_** approach to data management in `R`.

- The goal of the `dplyr` logic is to provide an easy, intuitive naming convention for ubiquitous to data management tasks.

]

![:space 60]

---

## 6 main `dplyr` verbs

![:space 3]

- **`select()`**: Pick variables by their names.

- **`filter()`**: Pick observations by their values

- **`arrange()`**: Reorder the rows.

- **`mutate()`**: Create new variables with functions of existing variables.

- **`summarise()`**: Collapse many values down to a single summary.

- **`group_by()`**: changes the scope of each function from operating on the entire dataset to operating on it group-by-group. 
---

class:biglist

## How they work...

**All verbs work similarly:**

1. The first argument is a data frame.

2. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).

3. The result is a new data frame.

---

To walk through the performance of the main `dplyr` verbs, we'll use an internal dataset called `presidential`.

```r
?presidential # for information on the data
```

```r
dat <- presidential
head(dat)
```

```
## # A tibble: 6 x 4
##   name       start      end        party     
##   <chr>      <date>     <date>     <chr>     
## 1 Eisenhower 1953-01-20 1961-01-20 Republican
## 2 Kennedy    1961-01-20 1963-11-22 Democratic
## 3 Johnson    1963-11-22 1969-01-20 Democratic
## 4 Nixon      1969-01-20 1974-08-09 Republican
## 5 Ford       1974-08-09 1977-01-20 Republican
## 6 Carter     1977-01-20 1981-01-20 Democratic
```

---

# `select()`

---

![:space 10]

```r
select(dat,name,party)
```

```
## # A tibble: 11 x 2
##    name       party     
##    <chr>      <chr>     
##  1 Eisenhower Republican
##  2 Kennedy    Democratic
##  3 Johnson    Democratic
##  4 Nixon      Republican
##  5 Ford       Republican
##  6 Carter     Democratic
##  7 Reagan     Republican
##  8 Bush       Republican
##  9 Clinton    Democratic
## 10 Bush       Republican
## 11 Obama      Democratic
```

---

Or variable ranges using `:`

The following will provide all variables in-between `name` and `end`.

```r
select(dat,name:end)
```

```
## # A tibble: 11 x 3
##    name       start      end       
##    <chr>      <date>     <date>    
##  1 Eisenhower 1953-01-20 1961-01-20
##  2 Kennedy    1961-01-20 1963-11-22
##  3 Johnson    1963-11-22 1969-01-20
##  4 Nixon      1969-01-20 1974-08-09
##  5 Ford       1974-08-09 1977-01-20
##  6 Carter     1977-01-20 1981-01-20
##  7 Reagan     1981-01-20 1989-01-20
##  8 Bush       1989-01-20 1993-01-20
##  9 Clinton    1993-01-20 2001-01-20
## 10 Bush       2001-01-20 2009-01-20
## 11 Obama      2009-01-20 2017-01-20
```

---

The **order** in which variables are selected will translate to the output. Thus, one can easily **reorder columns** with `select()`.

```r
select(dat,name,end,start)
```

```
## # A tibble: 11 x 3
##    name       end        start     
##    <chr>      <date>     <date>    
##  1 Eisenhower 1961-01-20 1953-01-20
##  2 Kennedy    1963-11-22 1961-01-20
##  3 Johnson    1969-01-20 1963-11-22
##  4 Nixon      1974-08-09 1969-01-20
##  5 Ford       1977-01-20 1974-08-09
##  6 Carter     1981-01-20 1977-01-20
##  7 Reagan     1989-01-20 1981-01-20
##  8 Bush       1993-01-20 1989-01-20
##  9 Clinton    2001-01-20 1993-01-20
## 10 Bush       2009-01-20 2001-01-20
## 11 Obama      2017-01-20 2009-01-20
```

---

We can also easily **rename** variables by simply providing a new name within the function.

```r
select(dat,president=name,
       startdate=start,
       enddate=end)
```

```
## # A tibble: 11 x 3
##    president  startdate  enddate   
##    <chr>      <date>     <date>    
##  1 Eisenhower 1953-01-20 1961-01-20
##  2 Kennedy    1961-01-20 1963-11-22
##  3 Johnson    1963-11-22 1969-01-20
##  4 Nixon      1969-01-20 1974-08-09
##  5 Ford       1974-08-09 1977-01-20
##  6 Carter     1977-01-20 1981-01-20
##  7 Reagan     1981-01-20 1989-01-20
##  8 Bush       1989-01-20 1993-01-20
##  9 Clinton    1993-01-20 2001-01-20
## 10 Bush       2001-01-20 2009-01-20
## 11 Obama      2009-01-20 2017-01-20
```

---

Lastly, `select()` offers us a convenient way to drop variables by using the same logic that we employed with putting a **negative sign** in front of a dimension. The only difference here is that we can do the same but with a variable name.

Here we **drop** the `start` date variable.

```r
select(dat,-start)
```

```
## # A tibble: 11 x 3
##    name       end        party     
##    <chr>      <date>     <chr>     
##  1 Eisenhower 1961-01-20 Republican
##  2 Kennedy    1963-11-22 Democratic
##  3 Johnson    1969-01-20 Democratic
##  4 Nixon      1974-08-09 Republican
##  5 Ford       1977-01-20 Republican
##  6 Carter     1981-01-20 Democratic
##  7 Reagan     1989-01-20 Republican
##  8 Bush       1993-01-20 Republican
##  9 Clinton    2001-01-20 Democratic
## 10 Bush       2009-01-20 Republican
## 11 Obama      2017-01-20 Democratic
```

---

### Other useful `select()` behavior

---

### Other useful `select()` behavior

- `contains()` - extract columns with a specific naming convention.

```r
select(dat,contains("a"))
```

```
## # A tibble: 11 x 3
##    name       start      party     
##    <chr>      <date>     <chr>     
##  1 Eisenhower 1953-01-20 Republican
##  2 Kennedy    1961-01-20 Democratic
##  3 Johnson    1963-11-22 Democratic
##  4 Nixon      1969-01-20 Republican
##  5 Ford       1974-08-09 Republican
##  6 Carter     1977-01-20 Democratic
##  7 Reagan     1981-01-20 Republican
##  8 Bush       1989-01-20 Republican
##  9 Clinton    1993-01-20 Democratic
## 10 Bush       2001-01-20 Republican
## 11 Obama      2009-01-20 Democratic
```

---

### Other useful `select()` behavior

- `contains()` - extract columns with a specific naming convention.

- `starts_with()` - extract columns that start with a specific naming convention.

```r
select(dat,starts_with("s"))
```

```
## # A tibble: 11 x 1
##    start     
##    <date>    
##  1 1953-01-20
##  2 1961-01-20
##  3 1963-11-22
##  4 1969-01-20
##  5 1974-08-09
##  6 1977-01-20
##  7 1981-01-20
##  8 1989-01-20
##  9 1993-01-20
## 10 2001-01-20
## 11 2009-01-20
```

---

### Other useful `select()` behavior

- `contains()` - extract columns with a specific naming convention.

- `starts_with()` - extract columns that start with a specific naming convention.

- `ends_with()` - extract columns that end with a specific naming convention.

```r
select(dat,ends_with("d"))
```

```
## # A tibble: 11 x 1
##    end       
##    <date>    
##  1 1961-01-20
##  2 1963-11-22
##  3 1969-01-20
##  4 1974-08-09
##  5 1977-01-20
##  6 1981-01-20
##  7 1989-01-20
##  8 1993-01-20
##  9 2001-01-20
## 10 2009-01-20
## 11 2017-01-20
```

---

### Other useful `select()` behavior

- `contains()` - extract columns with a specific naming convention.

- `starts_with()` - extract columns that start with a specific naming convention.

- `ends_with()` - extract columns that end with a specific naming convention.

- `everything()` - extract every remaining column not yet stated in the selection.

```r
select(dat,start,end,everything())
```

```
## # A tibble: 11 x 4
##    start      end        name       party     
##    <date>     <date>     <chr>      <chr>     
##  1 1953-01-20 1961-01-20 Eisenhower Republican
##  2 1961-01-20 1963-11-22 Kennedy    Democratic
##  3 1963-11-22 1969-01-20 Johnson    Democratic
##  4 1969-01-20 1974-08-09 Nixon      Republican
##  5 1974-08-09 1977-01-20 Ford       Republican
##  6 1977-01-20 1981-01-20 Carter     Democratic
##  7 1981-01-20 1989-01-20 Reagan     Republican
##  8 1989-01-20 1993-01-20 Bush       Republican
##  9 1993-01-20 2001-01-20 Clinton    Democratic
## 10 2001-01-20 2009-01-20 Bush       Republican
## 11 2009-01-20 2017-01-20 Obama      Democratic
```

---

### Other useful `select()` behavior

- `contains()` - extract columns with a specific naming convention.

- `starts_with()` - extract columns that start with a specific naming convention.

- `ends_with()` - extract columns that end with a specific naming convention.

- `everything()` - extract every remaining column not yet stated in the selection.

- `matches()` - extract columns using a regular expression.

```r
select(dat,matches("^s"))
```

---

## `filter()`

![:space 3]

---

![:space 10]

```r
filter(dat,party == "Republican")
```

```
## # A tibble: 6 x 4
##   name       start      end        party     
##   <chr>      <date>     <date>     <chr>     
## 1 Eisenhower 1953-01-20 1961-01-20 Republican
## 2 Nixon      1969-01-20 1974-08-09 Republican
## 3 Ford       1974-08-09 1977-01-20 Republican
## 4 Reagan     1981-01-20 1989-01-20 Republican
## 5 Bush       1989-01-20 1993-01-20 Republican
## 6 Bush       2001-01-20 2009-01-20 Republican
```

---

## `arrange()`

![:space 3]

![:space 10]

```r
arrange(dat,party)
```

```
## # A tibble: 11 x 4
##    name       start      end        party     
##    <chr>      <date>     <date>     <chr>     
##  1 Kennedy    1961-01-20 1963-11-22 Democratic
##  2 Johnson    1963-11-22 1969-01-20 Democratic
##  3 Carter     1977-01-20 1981-01-20 Democratic
##  4 Clinton    1993-01-20 2001-01-20 Democratic
##  5 Obama      2009-01-20 2017-01-20 Democratic
##  6 Eisenhower 1953-01-20 1961-01-20 Republican
##  7 Nixon      1969-01-20 1974-08-09 Republican
##  8 Ford       1974-08-09 1977-01-20 Republican
##  9 Reagan     1981-01-20 1989-01-20 Republican
## 10 Bush       1989-01-20 1993-01-20 Republican
## 11 Bush       2001-01-20 2009-01-20 Republican
```

---

`arrange()` with the internal function `desc()` can change to a **descending** ordering.

```r
arrange(dat,desc(start))
```

```
## # A tibble: 11 x 4
##    name       start      end        party     
##    <chr>      <date>     <date>     <chr>     
##  1 Obama      2009-01-20 2017-01-20 Democratic
##  2 Bush       2001-01-20 2009-01-20 Republican
##  3 Clinton    1993-01-20 2001-01-20 Democratic
##  4 Bush       1989-01-20 1993-01-20 Republican
##  5 Reagan     1981-01-20 1989-01-20 Republican
##  6 Carter     1977-01-20 1981-01-20 Democratic
##  7 Ford       1974-08-09 1977-01-20 Republican
##  8 Nixon      1969-01-20 1974-08-09 Republican
##  9 Johnson    1963-11-22 1969-01-20 Democratic
## 10 Kennedy    1961-01-20 1963-11-22 Democratic
## 11 Eisenhower 1953-01-20 1961-01-20 Republican
```

---

## `mutate()`

![:space 3]

---

![:space 10]

```r
mutate(dat,
       # in office during cold war
       CW = start <= '1990-03-11')
```

```
## # A tibble: 11 x 5
##    name       start      end        party      CW   
##    <chr>      <date>     <date>     <chr>      <lgl>
##  1 Eisenhower 1953-01-20 1961-01-20 Republican TRUE 
##  2 Kennedy    1961-01-20 1963-11-22 Democratic TRUE 
##  3 Johnson    1963-11-22 1969-01-20 Democratic TRUE 
##  4 Nixon      1969-01-20 1974-08-09 Republican TRUE 
##  5 Ford       1974-08-09 1977-01-20 Republican TRUE 
##  6 Carter     1977-01-20 1981-01-20 Democratic TRUE 
##  7 Reagan     1981-01-20 1989-01-20 Republican TRUE 
##  8 Bush       1989-01-20 1993-01-20 Republican TRUE 
##  9 Clinton    1993-01-20 2001-01-20 Democratic FALSE
## 10 Bush       2001-01-20 2009-01-20 Republican FALSE
## 11 Obama      2009-01-20 2017-01-20 Democratic FALSE
```

---

`mutate()` also allows us to **_instantly_** use variables we just created.

```r
mutate(dat,
       CW = start <= '1990-03-11',
       CW = as.numeric(CW))
```

```
## # A tibble: 11 x 5
##    name       start      end        party         CW
##    <chr>      <date>     <date>     <chr>      <dbl>
##  1 Eisenhower 1953-01-20 1961-01-20 Republican     1
##  2 Kennedy    1961-01-20 1963-11-22 Democratic     1
##  3 Johnson    1963-11-22 1969-01-20 Democratic     1
##  4 Nixon      1969-01-20 1974-08-09 Republican     1
##  5 Ford       1974-08-09 1977-01-20 Republican     1
##  6 Carter     1977-01-20 1981-01-20 Democratic     1
##  7 Reagan     1981-01-20 1989-01-20 Republican     1
##  8 Bush       1989-01-20 1993-01-20 Republican     1
##  9 Clinton    1993-01-20 2001-01-20 Democratic     0
## 10 Bush       2001-01-20 2009-01-20 Republican     0
## 11 Obama      2009-01-20 2017-01-20 Democratic     0
```

---

Like `mutate()`, `transmute()` provides a method for generating a new variable, but unlike the former, it **returns only the newly created variable**.

```r
transmute(dat,CW = start <= '1990-03-11')
```

```
## # A tibble: 11 x 1
##    CW   
##    <lgl>
##  1 TRUE 
##  2 TRUE 
##  3 TRUE 
##  4 TRUE 
##  5 TRUE 
##  6 TRUE 
##  7 TRUE 
##  8 TRUE 
##  9 FALSE
## 10 FALSE
## 11 FALSE
```

---

## `summarize()`

![:space 15]

```r
summarize(dat,
          days_in_office = mean(end-start),
          max = max(end-start),
          min = min(end-start))
```

```
## # A tibble: 1 x 3
##   days_in_office max       min     
##   <drtn>         <drtn>    <drtn>  
## 1 2125.091 days  2922 days 895 days
```

---

![:space 10]

There are a number of internal functions that can be used with `mutate()`, `transmute()`,
and `summarize()`.

- `n()` counts the number of observations
- `n_distinct()` counts the number of distinct entries

![:space 10]

```r
summarize(dat,N=n(),N_party=n_distinct(party))
```

```
## # A tibble: 1 x 2
##       N N_party
##   <int>   <int>
## 1    11       2
```

---

## `group_by()`

![:space 3]

---

![:space 10]

When used in conjunction with some of the other functions, `group_by()` becomes a powerful to perform by cluster/unit/group operations.

![:space 5]

```r
# group by party
x <- group_by(dat,party)
summarize(x,min_in_office = min(end-start))
```

```
## # A tibble: 2 x 2
##   party      min_in_office
##   <chr>      <drtn>       
## 1 Democratic 1036 days    
## 2 Republican  895 days
```

---

### Other useful `tidyverse` functions

- `count()`/`add_count()`

```r
count(dat, party)
```

```
## # A tibble: 2 x 2
##   party          n
##   <chr>      <int>
## 1 Democratic     5
## 2 Republican     6
```

```r
add_count(dat, party)
```

```
## # A tibble: 11 x 5
##    name       start      end        party          n
##    <chr>      <date>     <date>     <chr>      <int>
##  1 Eisenhower 1953-01-20 1961-01-20 Republican     6
##  2 Kennedy    1961-01-20 1963-11-22 Democratic     5
##  3 Johnson    1963-11-22 1969-01-20 Democratic     5
##  4 Nixon      1969-01-20 1974-08-09 Republican     6
##  5 Ford       1974-08-09 1977-01-20 Republican     6
##  6 Carter     1977-01-20 1981-01-20 Democratic     5
##  7 Reagan     1981-01-20 1989-01-20 Republican     6
##  8 Bush       1989-01-20 1993-01-20 Republican     6
##  9 Clinton    1993-01-20 2001-01-20 Democratic     5
## 10 Bush       2001-01-20 2009-01-20 Republican     6
## 11 Obama      2009-01-20 2017-01-20 Democratic     5
```

---

### Other useful `tidyverse` functions

- `count()`/`add_count()`

- `recode()`

```r
mutate(dat,party = recode(party,'Republican'=1,'Democratic'=0))
```

```
## # A tibble: 11 x 4
##    name       start      end        party
##    <chr>      <date>     <date>     <dbl>
##  1 Eisenhower 1953-01-20 1961-01-20     1
##  2 Kennedy    1961-01-20 1963-11-22     0
##  3 Johnson    1963-11-22 1969-01-20     0
##  4 Nixon      1969-01-20 1974-08-09     1
##  5 Ford       1974-08-09 1977-01-20     1
##  6 Carter     1977-01-20 1981-01-20     0
##  7 Reagan     1981-01-20 1989-01-20     1
##  8 Bush       1989-01-20 1993-01-20     1
##  9 Clinton    1993-01-20 2001-01-20     0
## 10 Bush       2001-01-20 2009-01-20     1
## 11 Obama      2009-01-20 2017-01-20     0
```

---

### Other useful `tidyverse` functions

- `count()`/`add_count()`

- `recode()`

- `pull()`

```r
pull(dat,party)
```

```
##  [1] "Republican" "Democratic" "Democratic" "Republican" "Republican"
##  [6] "Democratic" "Republican" "Republican" "Democratic" "Republican"
## [11] "Democratic"
```

---

### Other useful `tidyverse` functions

- `count()`/`add_count()`

- `recode()`

- `pull()`

- `glimpse()`

```r
glimpse(dat)
```

```
## Rows: 11
## Columns: 4
## $ name  <chr> "Eisenhower", "Kennedy", "Johnson", "Nixon", "Ford", "Carter", …
## $ start <date> 1953-01-20, 1961-01-20, 1963-11-22, 1969-01-20, 1974-08-09, 19…
## $ end   <date> 1961-01-20, 1963-11-22, 1969-01-20, 1974-08-09, 1977-01-20, 19…
## $ party <chr> "Republican", "Democratic", "Democratic", "Republican", "Republ…
```

---

### Other useful `tidyverse` functions

- `count()`/`add_count()`

- `recode()`

- `pull()`

- `glimpse()`

- `slice()`

```r
slice(dat,2:3)
```

```
## # A tibble: 2 x 4
##   name    start      end        party     
##   <chr>   <date>     <date>     <chr>     
## 1 Kennedy 1961-01-20 1963-11-22 Democratic
## 2 Johnson 1963-11-22 1969-01-20 Democratic
```

---

### Other useful `tidyverse` functions

- `sample_n()`/`sample_frac()`

```r
sample_n(dat,2)
```

```
## # A tibble: 2 x 4
##   name    start      end        party     
##   <chr>   <date>     <date>     <chr>     
## 1 Kennedy 1961-01-20 1963-11-22 Democratic
## 2 Nixon   1969-01-20 1974-08-09 Republican
```

```r
sample_frac(dat,.25)
```

```
## # A tibble: 3 x 4
##   name   start      end        party     
##   <chr>  <date>     <date>     <chr>     
## 1 Obama  2009-01-20 2017-01-20 Democratic
## 2 Reagan 1981-01-20 1989-01-20 Republican
## 3 Nixon  1969-01-20 1974-08-09 Republican
```

---

### Other useful `tidyverse` functions

- `sample_n()`/`sample_frac()`

- `case_when()`

```r
mutate(dat, 
       pres_type = case_when(
         (end - start) == 1461 ~ "one-term pres.",
         (end - start) == 2922 ~ "two-term pres.",
         TRUE ~ "special case"
       )
)
```

```
## # A tibble: 11 x 5
##    name       start      end        party      pres_type     
##    <chr>      <date>     <date>     <chr>      <chr>         
##  1 Eisenhower 1953-01-20 1961-01-20 Republican two-term pres.
##  2 Kennedy    1961-01-20 1963-11-22 Democratic special case  
##  3 Johnson    1963-11-22 1969-01-20 Democratic special case  
##  4 Nixon      1969-01-20 1974-08-09 Republican special case  
##  5 Ford       1974-08-09 1977-01-20 Republican special case  
##  6 Carter     1977-01-20 1981-01-20 Democratic one-term pres.
##  7 Reagan     1981-01-20 1989-01-20 Republican two-term pres.
##  8 Bush       1989-01-20 1993-01-20 Republican one-term pres.
##  9 Clinton    1993-01-20 2001-01-20 Democratic two-term pres.
## 10 Bush       2001-01-20 2009-01-20 Republican two-term pres.
## 11 Obama      2009-01-20 2017-01-20 Democratic two-term pres.
```

---

### Other useful `tidyverse` functions

- `sample_n()`/`sample_frac()`

- `case_when()`

- `lag()`/`lead()`

```r
mutate(dat,
       predecessor = lag(name,n=1,order_by=start)
)
```

```
## # A tibble: 11 x 5
##    name       start      end        party      predecessor
##    <chr>      <date>     <date>     <chr>      <chr>      
##  1 Eisenhower 1953-01-20 1961-01-20 Republican <NA>       
##  2 Kennedy    1961-01-20 1963-11-22 Democratic Eisenhower 
##  3 Johnson    1963-11-22 1969-01-20 Democratic Kennedy    
##  4 Nixon      1969-01-20 1974-08-09 Republican Johnson    
##  5 Ford       1974-08-09 1977-01-20 Republican Nixon      
##  6 Carter     1977-01-20 1981-01-20 Democratic Ford       
##  7 Reagan     1981-01-20 1989-01-20 Republican Carter     
##  8 Bush       1989-01-20 1993-01-20 Republican Reagan     
##  9 Clinton    1993-01-20 2001-01-20 Democratic Bush       
## 10 Bush       2001-01-20 2009-01-20 Republican Clinton    
## 11 Obama      2009-01-20 2017-01-20 Democratic Bush
```

---

### Other useful `tidyverse` functions

- `sample_n()`/`sample_frac()`

- `case_when()`

- `lag()`/`lead()`

- `row_number()`

```r
mutate(dat,id = row_number())
```

```
## # A tibble: 11 x 5
##    name       start      end        party         id
##    <chr>      <date>     <date>     <chr>      <int>
##  1 Eisenhower 1953-01-20 1961-01-20 Republican     1
##  2 Kennedy    1961-01-20 1963-11-22 Democratic     2
##  3 Johnson    1963-11-22 1969-01-20 Democratic     3
##  4 Nixon      1969-01-20 1974-08-09 Republican     4
##  5 Ford       1974-08-09 1977-01-20 Republican     5
##  6 Carter     1977-01-20 1981-01-20 Democratic     6
##  7 Reagan     1981-01-20 1989-01-20 Republican     7
##  8 Bush       1989-01-20 1993-01-20 Republican     8
##  9 Clinton    1993-01-20 2001-01-20 Democratic     9
## 10 Bush       2001-01-20 2009-01-20 Republican    10
## 11 Obama      2009-01-20 2017-01-20 Democratic    11
```

---

### Other useful `tidyverse` functions

- `sample_n()`/`sample_frac()`

- `case_when()`

- `lag()`/`lead()`

- `row_number()`

- `rowname_to_column()`/`column_to_rownames()`

```r
rownames_to_column(dat,'id')
```

```
## # A tibble: 11 x 5
##    id    name       start      end        party     
##    <chr> <chr>      <date>     <date>     <chr>     
##  1 1     Eisenhower 1953-01-20 1961-01-20 Republican
##  2 2     Kennedy    1961-01-20 1963-11-22 Democratic
##  3 3     Johnson    1963-11-22 1969-01-20 Democratic
##  4 4     Nixon      1969-01-20 1974-08-09 Republican
##  5 5     Ford       1974-08-09 1977-01-20 Republican
##  6 6     Carter     1977-01-20 1981-01-20 Democratic
##  7 7     Reagan     1981-01-20 1989-01-20 Republican
##  8 8     Bush       1989-01-20 1993-01-20 Republican
##  9 9     Clinton    1993-01-20 2001-01-20 Democratic
## 10 10    Bush       2001-01-20 2009-01-20 Republican
## 11 11    Obama      2009-01-20 2017-01-20 Democratic
```

---

### Other useful `tidyverse` functions

- `sample_n()`/`sample_frac()`

- `case_when()`

- `lag()`/`lead()`

- `row_number()`

- `rowname_to_column()`/`column_to_rownames()`

```r
column_to_rownames(head(dat,3),var="name")
```

```
##                 start        end      party
## Eisenhower 1953-01-20 1961-01-20 Republican
## Kennedy    1961-01-20 1963-11-22 Democratic
## Johnson    1963-11-22 1969-01-20 Democratic
```

---

### Cheating is allowed

![:space 10]

[**Download a cheatsheet**](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf) to  containing an entire list of the tidyverse `dplyr` verbs.

![:space 5]

Access any cheatsheet for the RStudio ecosystem of packages directly from the IDE.

---

class:newsection

# Piping

---

## Combining `dplyr` functions

When we need to do a series of manipulations, we can **perform each manipulation individually and save each entry as a new object** that we write over.

```r
x <- filter(presidential,party=='Republican')
x <- group_by(x,name)
x <- transmute(x,t_in_office = end-start)
x <- arrange(x,t_in_office)
x
```

```
## # A tibble: 6 x 2
## # Groups:   name [5]
##   name       t_in_office
##   <chr>      <drtn>     
## 1 Ford        895 days  
## 2 Bush       1461 days  
## 3 Nixon      2027 days  
## 4 Eisenhower 2922 days  
## 5 Reagan     2922 days  
## 6 Bush       2922 days
```

---

Or we can **nest** functions _within_ each other.

```r
arrange(
  transmute(
    group_by(
      filter(presidential,party=='Republican'),name),
    t_in_office = end-start),
  t_in_office)
```

The issue with **nesting functions** is that it is **![:text_color darkred](difficult to [1] read and [2] detect a mistake!)**

---

### Piping Functions

The **pipe** is a useful tool that allows us to **pass** output from one function to the next.

To pipe we write **`%>%`** _in-between_ each function.

We pass the output to specific locations in the proceeding function using the pointer `.`

Piping offers is a clean way of manipulating data that is **intuitive and easy to read**.

---

Here we **pass** our `data` to `filter()` then to `group_by()` then to `transmute()` and then finally to `arrange()` which returns our output!

```r
presidential %>%
  filter(party=='Republican') %>%
  group_by(name) %>%
  transmute(t_in_office = end-start) %>%
  arrange(t_in_office)
```

---

### Two things to keep in mind when piping...

![:space 15]

1. Functions **_must_** be linked with `%>%`

2. When functions have **multiple arguments**, point to where the data should go with a period (`.`)

```r
data %>% function(arg1= ., arg2=TRUE)
```

---

class:newsection

# Joining Data

---

## `_join` functions (dplyr)

![:space 5]

`dplyr` offers a range of joining/merging functions that are more intuitive to use. These functions provide a **SQL framework** that is easier to read and more efficient.
- `left_join()`
- `right_join()`
- `inner_join()`
- `full_join()`
- `anti_join()`

When joining data, you must have a **unique** identifier on the dimension you're matching on.

---

![:space 10]

Consider the following two example datasets...

```r
data_A
```

```
##    country Var1
## 1  Nigeria    4
## 2  England    3
## 3 Botswana    6
```

```r
data_B
```

```
##         country   Var2
## 1       Nigeria    Low
## 2 United States   High
## 3      Botswana Medium
```

---

## `left_join()`

---

## `left_join()`

![:space 10]

```r
left_join(data_A,data_B,by="country")
```

```
##    country Var1   Var2
## 1  Nigeria    4    Low
## 2  England    3   <NA>
## 3 Botswana    6 Medium
```

---

## `right_join()`

---

## `right_join()`

![:space 10]

```r
right_join(data_A,data_B,by="country")
```

```
##         country Var1   Var2
## 1       Nigeria    4    Low
## 2      Botswana    6 Medium
## 3 United States   NA   High
```

---

## `inner_join()`

---

## `inner_join()`

![:space 10]

```r
inner_join(data_A,data_B,by="country")
```

```
##    country Var1   Var2
## 1  Nigeria    4    Low
## 2 Botswana    6 Medium
```

---

### `full_join()` 
 
.center[<img src="Figures/full_join.png", height=100px>]

---

### `full_join()`

![:space 10]

```r
full_join(data_A,data_B,by="country")
```

```
##         country Var1   Var2
## 1       Nigeria    4    Low
## 2       England    3   <NA>
## 3      Botswana    6 Medium
## 4 United States   NA   High
```

---

### `anti_join()` 
 
.center[<img src="Figures/anti_join_left.png", height=100px>]

---

### `anti_join()`

![:space 10]

```r
anti_join(data_A,data_B,by="country")
```

```
##   country Var1
## 1 England    3
```

---

## `bind_rows()`

![:space 5]

```r
bind_rows(data_A,data_B)
```

```
##         country Var1   Var2
## 1       Nigeria    4   <NA>
## 2       England    3   <NA>
## 3      Botswana    6   <NA>
## 4       Nigeria   NA    Low
## 5 United States   NA   High
## 6      Botswana   NA Medium
```

---

## `bind_cols()`

![:space 5]

```r
bind_cols(data_A,data_B)
```

```
##   country...1 Var1   country...3   Var2
## 1     Nigeria    4       Nigeria    Low
## 2     England    3 United States   High
## 3    Botswana    6      Botswana Medium
```

---

## Disparate column names

![:space 10]

Sometimes the naming conventions of two datasets don't perfectly align. When this happens, we can specify how data merges onto one another more explicitly using the `by=` argument.

Moreover, we can merge on **_more_ than one dimension** by specifying all relevant column names.

---

![:space 10]

Once again, consider the following example data..

```r
data_A
```

```
##    country year Var1
## 1  Nigeria 1999    4
## 2  England 2001    3
## 3 Botswana 2000    6
```

```r
data_B
```

```
##    country_name year   Var2
## 1       Nigeria 1999    Low
## 2 United States 2004   High
## 3      Botswana 2003 Medium
```

---

![:space 20]

```r
full_join(data_A,data_B,
          by=c('country'='country_name',
               'year'))
```

```
##         country year Var1   Var2
## 1       Nigeria 1999    4    Low
## 2       England 2001    3   <NA>
## 3      Botswana 2000    6   <NA>
## 4 United States 2004   NA   High
## 5      Botswana 2003   NA Medium
```

---

class:newsection

# Reshaping Data

---

![:space 10]

Often, we need to alter the structure of a `data.frame` from a **wide format**...

![:space 10]

.center[
<table>
 <thead>
  <tr>
   <th style="text-align:center;"> country </th>
   <th style="text-align:center;"> 1992 </th>
   <th style="text-align:center;"> 1993 </th>
   <th style="text-align:center;"> 1994 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> Nigeria </td>
   <td style="text-align:center;"> 9.72 </td>
   <td style="text-align:center;"> 10.06 </td>
   <td style="text-align:center;"> 9.66 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Iran </td>
   <td style="text-align:center;"> 9.88 </td>
   <td style="text-align:center;"> 10.86 </td>
   <td style="text-align:center;"> 9.78 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Cambodia </td>
   <td style="text-align:center;"> 10.78 </td>
   <td style="text-align:center;"> 10.23 </td>
   <td style="text-align:center;"> 10.61 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Australia </td>
   <td style="text-align:center;"> 10.04 </td>
   <td style="text-align:center;"> 9.37 </td>
   <td style="text-align:center;"> 10.18 </td>
  </tr>
</tbody>
</table>
]

---

...into a **long format**

.center[
<table>
 <thead>
  <tr>
   <th style="text-align:center;"> country </th>
   <th style="text-align:center;"> year </th>
   <th style="text-align:center;"> var </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> Nigeria </td>
   <td style="text-align:center;"> 1992 </td>
   <td style="text-align:center;"> 9.72 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Nigeria </td>
   <td style="text-align:center;"> 1993 </td>
   <td style="text-align:center;"> 10.06 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Nigeria </td>
   <td style="text-align:center;"> 1994 </td>
   <td style="text-align:center;"> 9.66 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Iran </td>
   <td style="text-align:center;"> 1992 </td>
   <td style="text-align:center;"> 9.88 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Iran </td>
   <td style="text-align:center;"> 1993 </td>
   <td style="text-align:center;"> 10.86 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Iran </td>
   <td style="text-align:center;"> 1994 </td>
   <td style="text-align:center;"> 9.78 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Cambodia </td>
   <td style="text-align:center;"> 1992 </td>
   <td style="text-align:center;"> 10.78 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Cambodia </td>
   <td style="text-align:center;"> 1993 </td>
   <td style="text-align:center;"> 10.23 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Cambodia </td>
   <td style="text-align:center;"> 1994 </td>
   <td style="text-align:center;"> 10.61 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Australia </td>
   <td style="text-align:center;"> 1992 </td>
   <td style="text-align:center;"> 10.04 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Australia </td>
   <td style="text-align:center;"> 1993 </td>
   <td style="text-align:center;"> 9.37 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Australia </td>
   <td style="text-align:center;"> 1994 </td>
   <td style="text-align:center;"> 10.18 </td>
  </tr>
</tbody>
</table>
]

---

![:space 20]

`tidyr` is a tidyverse package built to help reshape data. The package contains an array of functions that are all useful cleaning a data construct.

`tidyr` eases tasks such as:

- dropping missing values
- filling missing values
- separating a column into two variables or uniting two columns into one

]

---

### `pivot_longer()`: from wide-to-long

![:space 15]

![:center_img 100](Figures/gather.png)

---

### `pivot_longer()`: from wide-to-long

![:space 10]

```r
dat
```

```
##     country  1992  1993  1994
## 1   Nigeria  9.72 10.06  9.66
## 2      Iran  9.88 10.86  9.78
## 3  Cambodia 10.78 10.23 10.61
## 4 Australia 10.04  9.37 10.18
```

---

### `pivot_longer()`: from wide-to-long

Specify the columns you want to expand

```r
dat %>% pivot_longer(cols="1992":"1994")
```

```
## # A tibble: 12 x 3
##    country   name  value
##    <fct>     <chr> <dbl>
##  1 Nigeria   1992   9.72
##  2 Nigeria   1993  10.1 
##  3 Nigeria   1994   9.66
##  4 Iran      1992   9.88
##  5 Iran      1993  10.9 
##  6 Iran      1994   9.78
##  7 Cambodia  1992  10.8 
##  8 Cambodia  1993  10.2 
##  9 Cambodia  1994  10.6 
## 10 Australia 1992  10.0 
## 11 Australia 1993   9.37
## 12 Australia 1994  10.2
```

---

### `pivot_longer()`: from wide-to-long

Rename the expanded columns

```r
dat %>% pivot_longer(cols="1992":"1994",
                     names_to = "year",
                     values_to = "ln_gdppc")
```

```
## # A tibble: 12 x 3
##    country   year  ln_gdppc
##    <fct>     <chr>    <dbl>
##  1 Nigeria   1992      9.72
##  2 Nigeria   1993     10.1 
##  3 Nigeria   1994      9.66
##  4 Iran      1992      9.88
##  5 Iran      1993     10.9 
##  6 Iran      1994      9.78
##  7 Cambodia  1992     10.8 
##  8 Cambodia  1993     10.2 
##  9 Cambodia  1994     10.6 
## 10 Australia 1992     10.0 
## 11 Australia 1993      9.37
## 12 Australia 1994     10.2
```

---

### `pivot_longer()`: from wide-to-long

![:space 10]
Be selective about which columns are expanded and which are not.

```r
# variables can be excluded from the reshape
dat %>% pivot_longer(cols="1992",
                     names_to = "year",
                     values_to = "ln_gdppc") 
```

```
## # A tibble: 4 x 5
##   country   `1993` `1994` year  ln_gdppc
##   <fct>      <dbl>  <dbl> <chr>    <dbl>
## 1 Nigeria    10.1    9.66 1992      9.72
## 2 Iran       10.9    9.78 1992      9.88
## 3 Cambodia   10.2   10.6  1992     10.8 
## 4 Australia   9.37  10.2  1992     10.0
```

---

## `pivot_wider()`: from long-to-wide

![:space 10]

---

## `pivot_wider()`: from long-to-wide

![:space 10]

```r
dat 
```

```
##      country year ln_gdppc
## 1    Nigeria 1992     9.72
## 3   Cambodia 1992    10.78
## 4  Australia 1992    10.04
## 5    Nigeria 1993    10.06
## 6       Iran 1993    10.86
## 8  Australia 1993     9.37
## 9    Nigeria 1994     9.66
## 10      Iran 1994     9.78
## 11  Cambodia 1994    10.61
## 12 Australia 1994    10.18
```

---

## `pivot_wider()`: from long-to-wide

Main arguments:

- `names_from`: name of the variable that will be spread out into columns. 
- `values_from`: values that will populate the cells of each column

![:space 3]

```r
dat %>% pivot_wider(names_from = year,
                    values_from = ln_gdppc)
```

```
## # A tibble: 4 x 4
##   country   `1992` `1993` `1994`
##   <fct>      <dbl>  <dbl>  <dbl>
## 1 Nigeria     9.72  10.1    9.66
## 2 Cambodia   10.8   NA     10.6 
## 3 Australia  10.0    9.37  10.2 
## 4 Iran       NA     10.9    9.78
```

---

## `pivot_wider()`: from long-to-wide

Main arguments:

- `names_from`: name of the variable that will be spread out into columns. 
- `values_from`: values that will populate the cells of each column.
- `values_fill`: specify what value a missing value should take on.

```r
dat %>% pivot_wider(names_from = year,
                    values_from = ln_gdppc,
                    values_fill = list(ln_gdppc = -99))
```

```
## # A tibble: 4 x 4
##   country   `1992` `1993` `1994`
##   <fct>      <dbl>  <dbl>  <dbl>
## 1 Nigeria     9.72  10.1    9.66
## 2 Cambodia   10.8  -99     10.6 
## 3 Australia  10.0    9.37  10.2 
## 4 Iran      -99     10.9    9.78
```

---

## Legacy code

![:space 5]

Note that `pivot_wider()` and `pivot_longer()` are relatively new wrappers for the older `gather()` and `spread()` functions.

![:space 5]

What does this mean for you?

- Answers online and in the reading on how to reshape data may differ depending on the preference of the source.

- Use whatever versions of the function that work best for you.

- Many find that  `pivot_wider()` and `pivot_longer()` are easier to use than the older `gather()` and `spread()` functions.