PPOL670 | Introduction to Data Science for Public Policy Week 2 Introduction to Programming in R

# PPOL670 | Introduction to Data Science for Public Policy Week 2 Introduction to Programming in <code>R</code> 
###  Prof. Eric Dunford  ◆  Georgetown University  ◆  McCourt School of Public Policy  ◆  <a href="eric.dunford@georgetown.edu">eric.dunford@georgetown.edu</a>

---

<div class="slide-footer"> 
PPOL670 | Introduction to Data Science for Public Policy

&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;

Week 2

&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;

Introduction to R

</div>

---
class: outline

# Outline for Today

![:space 3]

- **_Objects_**

- **_Data Structures_** and **_Accessing_** Data Points

- Mathematical and Logical **_Operators_**

- **_Functions_** and **_Packages_**

- **_Importing_** and **_Exporting_** data

- **_Working directory_** and **_Projects_**

---

# Objects

---

# Objects

![:space 10]

R uses a specific set of rules to govern how it looks up values in the environment.

We manage data by assigning it a name, and referencing that name when we need to use the information again.

Officially, this is called `lexical scoping`, which comes from the computer science term "[lexing](https://en.wikipedia.org/wiki/Lexical_analysis)". Lexing is the process by which text represents meaningful pieces of information that the programming language understands.

---

# Assigning an Object

![:space 10]

In simple terms, an `object` is a bit of text that represents a specific value.

```r
x <- 3
x
```

```
## [1] 3
```

Here we've assigned the value `3` to the letter `x`. Whenever we type `x`, `R` understands that we really mean `3`.

---

# Assigning an Object

![:space 5]

There are three standard assignment operators:
- `<-`
- `=`
- `assign()`

"Best practice" is to use the `<-` assignment operator.

```r
x1 <- 3
x2 = 3
assign("x3",3)
c(x1, x2, x3)
```

```
## [1] 3 3 3
```

---

# Assigning an Object

![:space 5]

Note that lexical scoping is flexible. Objects can be written and re-written when necessary.

```r
object <- 5
object
```

```
## [1] 5
```

```r
object <- "A Very Vibrant Shade of Purple"
object
```

```
## [1] "A Very Vibrant Shade of Purple"
```

> Down the road it will help to give objects meaningful names!

---

# Objects

One can see all the objects in the environment by either looking at the user interface in RStudio (specifically, the **environment tab**)...

---

# Objects

![:space 10]

One can see all the objects in the environment by either looking at the user interface in RStudio (specifically, the **environment tab**)... or by typing `ls()` in the console.

```r
ls()
```

```
## [1] "binary"  "message" "object"  "owd"     "x"       "x1"      "x2"      "x3"
```

---

# Object Classes

![:space 10]

Once assigned, an object has a **class**. A class describes the properties of the **data type** or **data structure** assigned to an object.

We can use the function `class()` to find out what kind of data type or structure our object is.

```r
class(x) 
```

```
## [1] "numeric"
```

The object `x` is of class `numeric`, i.e. a number.

---

# Object Classes

There are [many classes](https://www.tutorialspoint.com/r/r_data_types.htm) that an object can take.

```r
obj1 <- "This is a sentence"
obj2 <- TRUE
obj3 <- factor("This is a sentence")
c(class(obj1),class(obj2),class(obj3))
```

```
## [1] "character" "logical" "factor"
```

> Understanding what class of object one is dealing with is important --- as it will determine what kind of manipulations one can do or what functions an object will work with.

---

# Object Classes

![:space 10]

As noted, there are many different **data types** in `R`. We will primarily run into the following types:
 
 
.center[
| Type | Example|
|---|---|
| Integer | `7` |
| Numeric/Double | `4.56` |
| Character | "Hello!" |
| Logical | `TRUE` |
| Factor | `"cat" (1)` |
]

---

# Object Coercion

![:space 10]

When need be, an object can be **coerced** to be a different class.

```r
x
```

```
## [1] 3
```

```r
as.character(x)
```

```
## [1] "3"
```
Here we transformed `x` -- which was an object containing the value `3` --  into a character. `x` is now a string with the text "3".

---

# Removing objects from the Environment

![:space 10]

We often want to get rid of objects after creating them. To **delete** (or drop) an object from the working directory, use the function `rm()` -- which stands for "remove".

```r
ls()
```

```
##  [1] "binary"  "message" "obj1"    "obj2"    "obj3"    "object"  "owd"     "x"       "x1"      "x2"     
## [11] "x3"
```

```r
rm(x,x1,x2,x3,X)
ls()
```

```
## [1] "binary"  "message" "obj1"    "obj2"    "obj3"    "object"  "owd"
```

---

# Clearing the Environment

We can also remove **_all_** objects from the environment at once by typing the following command.

```r
rm(list=ls(all=T))
```

Or we can do so from R Studio by clicking on the `broom icon`.

---

# Objects: So what's the point?

![:space 5]

**Objects** offer a way to **reference different data**. This means that we can play around with _a lot_ of different data type simultaneously.

This makes it easier to:

--
  + **manage** and use multiple datasets at the same time
  + **extract** and manipulate single variables
  + **work** with little bits of data at a time to make sure your calculations work.

--
> Note that the only way to hold onto information is to assign it as an object! Else the information is printed but instantly forgotten by `R`

---

class:newsection

# Data Structures

---

## Data Structures

![:space 10]

There are also many ways data can be **organized** in `R`.

The same object can be organized in different ways depending on the needs to the user. Some commonly used data structures include:

- `vector`
- `matrix`
- `data.frame`
- `list`
- `array`

---

# Vector

![:space 10]

```r
X <- c(1, 2, 4, 5, 44, 6, 10)
X
```

```
## [1]  1  2  4  5 44  6 10
```

```r
class(X)
```

```
## [1] "numeric"
```

```r
length(X)
```

```
## [1] 7
```

---

# Data Frame

![:space 10]

```r
data.frame(X)
```

```
##    X
## 1  1
## 2  2
## 3  4
## 4  5
## 5 44
## 6  6
## 7 10
```

---

# Matrix

![:space 10]

```r
matrix(X)
```

```
##      [,1]
## [1,]    1
## [2,]    2
## [3,]    4
## [4,]    5
## [5,]   44
## [6,]    6
## [7,]   10
```

---

# List

![:space 10]

```r
list(X)
```

```
## [[1]]
## [1]  1  2  4  5 44  6 10
```

---

# Array

![:space 10]

```r
array(X,dim = c(2,2,2))
```

```
## , , 1
## 
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## 
## , , 2
## 
##      [,1] [,2]
## [1,]   44   10
## [2,]    6    1
```

---

# The point...

![:space 5]

- **Many ways to organize the same piece of information `R`**
  + different data structures afford us different advantages and bring with them different limitations.
  
-  We need to understand the **_structure_** of a data object to understand how to **_access_** the information inside.

- As you become more acquainted with `R`, you'll see and use other types of data structures more often.

- We'll rely mainly on a special type of `data.frame` called a `tibble` data frame. (More on this next time!)

---

class:newsection

# Indexing

---

## Structure Matters

![:space 5]

One must understand the **structure** of a data object in order to systematically access the material contained within it.

![:space 5]

As we saw, `R` allows for **_many different ways of organizing information_**.

![:space 5]

To **_access information_** in a data structure, we'll need to rely on an **index** or **key**.

---

### Index

An **_index_** is a positive integer that denotes the position of a data value in a data structure.

![:space 2]

`R` uses a **_`1`-based_** index system, which means we start counting at 1 as the first position in the data structure.

![:space 2]

We use **_brackets_**, `[ ]`, to index data values. The brackets operator goes along side the object name, `x[]`.

![:space 2]

We then **_insert an integer (or a vector of integers)_** denoting the position of the value(s) we want into the bracket.

---

### Index

Consider the following `vector` which contains five character data values.

Under the hood, there lies the _`1`-based index that references the position of each data point_.

```
vec <- c("A","B","C","D","E")
 ^ ^ ^ ^ ^
 1 2 3 4 5
```

To access any **_individual_** value, we simply need to reference it's position.

```r
vec[3]
```

```
## [1] "C"
```

---

### Index

Consider the following `vector` which contains five character data values.

Under the hood, there lies the _`1`-based index that references the position of each data point_.

```
vec <- c("A","B","C","D","E")
 ^ ^ ^ ^ ^
 1 2 3 4 5
```

To access any **_multiple_** values, we need to supply a vector of index positions.

```r
vec[c(1,3,5)]
```

```
## [1] "A" "C" "E"
```

---

### Index

Consider the following `vector` which contains five character data values.

Under the hood, there lies the _`1`-based index that references the position of each data point_.

```
vec <- c("A","B","C","D","E")
 ^ ^ ^ ^ ^
 1 2 3 4 5
```

If we reference positions that **_exceeds the bounds_**, `R` returns a missing value or `NA`

```r
vec[6]
```

```
## [1] NA
```

---

### Index

Consider the following `vector` which contains five character data values.

Under the hood, there lies the _`1`-based index that references the position of each data point_.

```
vec <- c("A","B","C","D","E")
 ^ ^ ^ ^ ^
 1 2 3 4 5
```

A **_negative index_** tells `R` to **_exclude_** that data value while returning the rest.

```r
vec[-4]
```

```
## [1] "A" "B" "C" "E"
```

---

### Indexing in two-dimensions

A `vector` is a **_one-dimensional_** data structure, so there is only 1 index to keep track of.

When accessing data points in a `data.frame` or `matrix`, we need to keep track of **_2 dimensions_**

```
 1 2
 ^ ^
 var_1 var_2
 1 < "a" 2.3
 2 < "b" 1.2
 3 < "c" 3.4

```

We need to keep track of two indices: one for the rows, and one for the columns

---

---

---

---

---

---

<img src="lecture-week-02_intro-to-R_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" />
 
---

---

---

### Indexing in two-dimensions

Let's use a dataset inherent to `R` called `cars`. There are a number of datasets that are built into `R`. These are for demonstration purposes.

Note that these data will not appear in the environment **_until we assign them to an object_**.

```r
data <- cars
class(data)
```

```
## [1] "data.frame"
```

---

### Indexing in two-dimensions

We can look at the **structure** of a data object by using the `str()` function.

```r
str(data)
```

```
## 'data.frame':	50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
```

The function `dim()` can tell use about the dimensions of a data object.

```r
dim(data)
```

```
## [1] 50  2
```

---

### Indexing in two-dimensions

We can look at the **structure** of a data object by using the `str()` function.

```r
str(data)
```

```
## 'data.frame':	50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
```

Or query the number of columns or row directly with `ncol()`/`nrow()`

```r
ncol(data)
```

```
## [1] 2
```

```r
nrow(data)
```

```
## [1] 50
```

---

### Indexing in two-dimensions

We can look at the **structure** of a data object by using the `str()` function.

```r
str(data)
```

```
## 'data.frame':	50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
```

![:space 5]

**The Point**: _one needs to know the dimensions of a relational data structure to look up values_.

---

## Keys

Some data structures (`data.frame`s and named `list`s) have **_keys_** that allow us to look up data values.

**_Keys_** are a unique identifier (usually a character value) that we can use to look up data values.

For a `data.frame` these keys take the form of **_variable names_** that provide a unique identifier for each column.

We can look up these variable names using the `colnames()` function.

```r
colnames(data)
```

```
## [1] "speed" "dist"
```

---

## Looking up data values with keys

We can access a data object's keys using the `$` operator.

`$` acts as a **_handle_** by which we can look up all available keys and extract a specific data feature.

If we hit **Tab** after specifying the `$` after our data object, R Studio will offer a list of all available variables.

---

## Looking up data values with keys

![:space 10]

Here we call the `speed` variable from our dataset using the `$` and the variable name (key).

```r
data$speed
```

```
##  [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 16 16 17 17 17 18 18
## [34] 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25
```

---

## Looking up data values with keys

![:space 10]

We can also reference the key directly using the `[]` brackets operator and the key name.

```r
data[ , "speed"]
```

```
##  [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 16 16 17 17 17 18 18
## [34] 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25
```

---

## Looking up data values with keys

We'll come across specialized data types, like the output from a **_linear model_**.

```r
m <- lm(dist ~ speed, data = data)
str(m)
```

```
## List of 12
## $ coefficients : Named num [1:2] -17.58 3.93
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "speed"
## $ residuals : Named num [1:50] 3.85 11.85 -5.95 12.05 2.12 ...
## ..- attr(*, "names")= chr [1:50] "1" "2" "3" "4" ...
## $ effects : Named num [1:50] -303.914 145.552 -8.115 9.885 0.194 ...
## ..- attr(*, "names")= chr [1:50] "(Intercept)" "speed" "" "" ...
## $ rank : int 2
## $ fitted.values: Named num [1:50] -1.85 -1.85 9.95 9.95 13.88 ...
## ..- attr(*, "names")= chr [1:50] "1" "2" "3" "4" ...
## $ assign : int [1:2] 0 1
## $ qr :List of 5
## ..$ qr : num [1:50, 1:2] -7.071 0.141 0.141 0.141 0.141 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:50] "1" "2" "3" "4" ...
## .. .. ..$ : chr [1:2] "(Intercept)" "speed"
## .. ..- attr(*, "assign")= int [1:2] 0 1
## ..$ qraux: num [1:2] 1.14 1.27
## ..$ pivot: int [1:2] 1 2
## ..$ tol : num 1e-07
## ..$ rank : int 2
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 48
## $ xlevels : Named list()
## $ call : language lm(formula = dist ~ speed, data = data)
## $ terms :Classes 'terms', 'formula' language dist ~ speed
## .. ..- attr(*, "variables")= language list(dist, speed)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "dist" "speed"
## .. .. .. ..$ : chr "speed"
## .. ..- attr(*, "term.labels")= chr "speed"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: 0x7fec4e6d43b0> 
## .. ..- attr(*, "predvars")= language list(dist, speed)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "dist" "speed"
## $ model :'data.frame':	50 obs. of 2 variables:
## ..$ dist : num [1:50] 2 10 4 22 16 10 18 26 34 17 ...
## ..$ speed: num [1:50] 4 4 7 7 8 9 10 10 10 11 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' language dist ~ speed
## .. .. ..- attr(*, "variables")= language list(dist, speed)
## .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:2] "dist" "speed"
## .. .. .. .. ..$ : chr "speed"
## .. .. ..- attr(*, "term.labels")= chr "speed"
## .. .. ..- attr(*, "order")= int 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: 0x7fec4e6d43b0> 
## .. .. ..- attr(*, "predvars")= language list(dist, speed)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. .. ..- attr(*, "names")= chr [1:2] "dist" "speed"
## - attr(*, "class")= chr "lm"
```

---

## Looking up data values with keys

We'll come across specialized data types, like the output from a **_linear model_**.

But don't worry, these are just **_named lists_** that use keys as indices.

```r
names(m)
```

```
##  [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values" "assign"       
##  [7] "qr"            "df.residual"   "xlevels"       "call"          "terms"         "model"
```

We can use those names to look up specific output from what looks like a complex object. It's that easy.

```r
m$coefficients
```

```
## (Intercept)       speed 
##  -17.579095    3.932409
```

---

class:newsection

# Operators

---

## Mathematical Operators

Broadly speaking, `R` functions as general calculator that can process a variety of data types.

As we can see, most operators in `R` are the usual suspects, but some forms are particular to `R`.

.center[
| Operation       |      Calc      |     Out |
|-----------------|-----------------|-----------------|
|Addition     |         `3 + 4`     |      `7`|
|Subtraction  |          `3 - 4`    |       `-1`|
|Multiplication    |    `3 * 4`       |    `12`|
|Division           |   `3 / 4`     |     `.75`|
|Exponentiation     |   `3 ^ 4`      |     `81`|
|Modulo     |  `4%%3`     |     `1`|
]

In the example, we'll walk through a few more operators.

---

# Mathematical Functions

![:space 10]

There are a range of functions designed to ease mathematical calculations. Some of these functions are to calculate specific values, such as the **natural log** or **Euler's number** ($e^a$).

```r
log(4)
```

```
## [1] 1.386294
```

```r
exp(5)
```

```
## [1] 148.4132
```

---

![:space 10]

There are a range of functions designed to ease mathematical calculations. Others can be used to find the **sum** for a numerical vector, the **mean**, or the **median**

```r
x <- c(1,3,7,100)
sum(x)
```

```
## [1] 111
```

```r
mean(x)
```

```
## [1] 27.75
```

```r
median(x)
```

```
## [1] 5
```

---

# Logical Operators

Boolean statement (i.e. true/false statements) are central to any computer programming environment. Boolean statements allow us to make quick conditional evaluations, which are key to **subsetting** data.

The following outlines the various types of boolean statements available.

```r
x == y # equals to
x != y # does not equal
x >= y # greater than or equal to
x <= y # less than or equal to
x > y # greater than
x < y # less than
```

Statements can be combined using **and** (`&`) **or** (`|`) statements to make more specific queries.

```r
x==1 & y==5 # "and" conditional statements
x==1 | y==5 # "or" conditional statements
```

---

![:space 10]

Boolean statements can be fed directly into data objects via the brackets method `[]`. This offers a powerful and simple way to subset data.

```r
x <- c(1,33,100,.6,5,77)
x
```

```
## [1]   1.0  33.0 100.0   0.6   5.0  77.0
```

```r
x[x > 30]
```

```
## [1]  33 100  77
```

---

![:space 10]

There are also a number of base functions that provide useful boolean evaluations. Here are just a few examples...

```r
is.character("hello") # for class
```

```
## [1] TRUE
```

```r
all(c(T,F,F)) # are all entries True?
```

```
## [1] FALSE
```

```r
identical(1+1,2) # are these entries the same?
```

```
## [1] TRUE
```

---

![:space 10]

Finally, boolean statements have a nice property in `R`. If we convert a boolean statement to a **numeric class**, `TRUE` values convert to `1` and `FALSE` values convert to `0`.

This offers us a quick way of generating **dichotomous** values.

```r
x <- 1:10
x >= 5
```

```
##  [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
```

```r
as.numeric(x >= 5)
```

```
##  [1] 0 0 0 0 1 1 1 1 1 1
```

---

## Subsetting with logical operators

We can combine what we know about logical and accessing the columns and rows in a relational `data.frame` to a powerful effect.

```r
d <- data.frame(x = c(100,200,300,400),
 y = c("a","b","b","a"))
d
```

```
##     x y
## 1 100 a
## 2 200 b
## 3 300 b
## 4 400 a
```

```r
d[ d$x > 2, ]
```

```
##     x y
## 1 100 a
## 2 200 b
## 3 300 b
## 4 400 a
```

---

# Functions

---

# What are functions?

![:space 10]

A **function** is a type of object in `R` that can perform a specific task. Unlike objects that hold data, functions take **arguments** as input and output some manipulated form of the inputed data.

A function is specified first with the object name and then parentheses. For example, the function `log()` calculates the natural log of any number placed inside the parentheses.

```r
log(4)
```

```
## [1] 1.386294
```

---

# Where are functions exactly?

![:space 10]

Functions operate in the **background**.

There are a number of functions in `R`, known as **base functions**, that are always running when you turn `R` on.

When we need to do things that are **not** a part of the base functionality, we can import new functions by installing **packages**. But more on this later.

---

# Some common functions

![:space 5]

We've already come a across a few functions, and we'll learn a lot more moving forward. Just keep in mind that whenever something is wrapped in parentheses `()`, it's a function.

Here are examples of a few common base functions that we'll see.

---

## Figuring out what a function does...

All functions in `R` contain rich documentation regarding how a function works, the inputs it requires, and example code. We can access this documentation by using `?` in front of the function.

```r
?c()
```

---

class:newsection

# Packages

---

## R Packages

![:space 10]

There are a number of `packages` that are supplied with the R distribution. These are known as "[base packages](https://stat.ethz.ch/R-manual/R-devel/library/base/html/00Index.html)" and they are in the background the second one starts a session in R.

- A **`package`** is a set of functions and programs that perform specific tasks.

- By installing packages, **we introduce new forms of functionality to the R environment**.

---

## R Packages

![:space 10]

To use the content in a package, one first needs to **install it**. One can do this by utilizing the following function: `install.packages()`. By inserting the name of a specific package, we can connect to an R "mirror" and download the binary of the package.

```r
install.packages("tidyverse")
```

The version of that package is then saved on your computer and can be called at any time (on or offline).

---

## R Packages

![:space 10]

Once installed, it's on the system for good. You can then reference or load the package any time you wish to use a function from it.

There are two functions we can use to load a package: `library()` and `require()`.

```r
library(tidyverse)

# or

require(tidyverse)
```

> You must load the package before you can use any function in it.

---

`R Studio` also offers us a way to install packages through the interface.

If we click on the `Packages` tab and then click `Install`, we can download a package by typing its name.

---

We then can **load** the package from R Studio by clicking the check box beside the packages name.

---

Sometimes one has _a lot_ of packages running simultaneously.

No problem: we can see what packages are up and running by typign `sessionInfo()` into the console.

This will tell us everything about the version of R and the packages we are using to run our analysis.

```r
sessionInfo()
```

```
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.5
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## Random number generation:
##  RNG:     Mersenne-Twister 
##  Normal:  Inversion 
##  Sample:  Rounding 
##  
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] shiny_1.4.0     bindrcpp_0.2.2  forcats_0.5.0   stringr_1.4.0   dplyr_1.0.0     purrr_0.3.4    
##  [7] readr_1.3.1     tidyr_1.1.0     tibble_3.0.3    ggplot2_3.3.2   tidyverse_1.3.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.5       lubridate_1.7.4  assertthat_0.2.1 digest_0.6.25    packrat_0.5.0   
##  [6] mime_0.9         R6_2.4.1         cellranger_1.1.0 backports_1.1.7  reprex_0.3.0    
## [11] evaluate_0.14    httr_1.4.1       xaringan_0.14    pillar_1.4.4     rlang_0.4.7     
## [16] readxl_1.3.1     rstudioapi_0.11  rmarkdown_2.3    servr_0.15       munsell_0.5.0   
## [21] broom_0.7.0      compiler_3.6.2   httpuv_1.5.2     modelr_0.1.6     xfun_0.12       
## [26] pkgconfig_2.0.3  htmltools_0.4.0  tidyselect_1.1.0 codetools_0.2-16 fansi_0.4.1     
## [31] crayon_1.3.4     dbplyr_1.4.2     withr_2.2.0      later_1.1.0.1    grid_3.6.2      
## [36] jsonlite_1.7.0   xtable_1.8-4     gtable_0.3.0     lifecycle_0.2.0  DBI_1.1.0       
## [41] magrittr_1.5     scales_1.1.0     cli_2.0.2        stringi_1.4.6    farver_2.0.3    
## [46] fs_1.3.2         promises_1.1.0   xml2_1.2.2       ellipsis_0.3.1   generics_0.0.2  
## [51] vctrs_0.3.1      tools_3.6.2      glue_1.4.1       hms_0.5.3        rsconnect_0.8.16
## [56] fastmap_1.0.1    yaml_2.2.1       colorspace_1.4-1 rvest_0.3.5      knitr_1.28      
## [61] bindr_0.1.1      haven_2.3.1
```

---

## Remember to Load Your Package!

If you ever try to run a function and you get the following prompt...

Error: could not find function "qplot"

It's likely you forgot to load the package .

```r
require(ggplot2) # First Load the package
qplot() # Then run the function
# Wah-la!
```

---

class:newsection

# Importing & Exporting Data

---

`R` allows you to import a large variety of datasets into the environment. However, `R`'s base packages only support a few data types.

No Fear: there is usually always an **external package** that can do the job!

We are going to focus on **three packages** to import different data types:

- `readr` --- an expansive array of functions to read different data types
- `readxl` --- for excel spreadsheets
- `haven` --- for SPSS, SAS, and .dta

---

First, we need to **install** these packages onto our computer.

```r
install.packages("readr")
install.packages("readxl")
install.packages("haven")
```

And then **load** them into our current `R` Session.

```r
require(readr)
require(readxl)
require(haven)
```

---

# Importing data

Here we will review how to import five separate data types:
- `.dta` --- STATA file format
- `.csv` --- comma seperated file format
- `.sav` --- SPSS file format
- `.xlsx` --- standard Excel file format
- `.Rdata` --- R's file format

---

# .dta

![:space 10]

For all versions of STATA

```r
require(haven)
data <- read_dta(file = "data.dta")
```
 
Other packages:
- `readstata13`
- `foreign`

---

# .csv

`read.csv()` and `read.table()` are both **base functions** in `R`.

```r
data <- read.csv(file = "data.csv",
 stringsAsFactors = F)
# Or

data <- read.table(file = "data.csv",
 header = T,
 sep=",",
 stringsAsFactors = F)
```

These functions have specific **arguments** that we are referencing:
- `stringsAsFactors` means that we don't want all `character` vectors in the `data.frame` to be converted to `factors`.
- `header` means the first row of the data are column names.
- `sep` means that entries are separated by commas.

---

# .csv

![:space 10]

The `readr` package provides a much simpler approach.

```r
require(readr)
data <- read_csv("data.csv")
```
- `characters` aren't converted to `factors`.
- More efficient as `$N$` increase

---

# .sav

![:space 10]

For `SPSS` and `SAS` file formats, the `haven` packages offers a simple way of reading in data.

```r
require(haven)
data <- read_sav(file = "data.sav") # SPSS
```

---

# .xlsx

![:space 10]

```r
require(readxl)
data <- read_excel("data.xlsx")
```

Even select from specific sheets.

```r
excel_sheets("data.xlsx") # list avail. sheets
```
    [1] Sheet1, Sheet2

```r
data <- read_excel("data.xlsx",
 sheet = 'Sheet1')
```

---

# .Rdata

![:space 10]

`.Rdata` is the data source inherent to `R`. It saves and loads `objects`.

```r
load(file='data.Rdata')
```

---

# Importing Data Using R Studio

There is also a point-and-click option for importing and exporting data in R.

If we go into the `Environments` tab and then click `Import Dataset`

---

# Exporting data

Exporting data is the same process in reverse. Instead of **reading** the data, we want to **write** a new version of it.

There are a series of functions (each provided by their respective packages) that allow us to do just that.

Each require that you input the **data** that you're looking to export and specify the **file name** and **paths** to tell the computer where the file is going.

---

# Exporting data

![:space 10]

```r
write_dta(data,path ="data.dta")

write_csv(data,path ="data.csv")

write_sav(data,path ="data.sav")

write_sas(data,path ="data.sas")

write_tsv(data,path ="data.tab")

# etc.
```

---

# .Rdata

![:space 10]

`.Rdata` offers two options to save data. We can either save a single data object, or save the entire workspace

```r
# Save just an object
save(data, file="data.Rdata")

# Save the entire workspace
save.image(file="workspace.Rdata")
```

---

# But where is my data exactly?

---

# But where is my data exactly?

![:space 10]

`R` doesn't intuitively know where your data is. If the data is in a special folder entitled "`my_data`", we have to tell `R` how to get there.

We can do this three ways:

1. Set the **working directory** to that folder
  2. Set the directory via a point-and-click option in `R Studio`
  3. Establish the **path** to that directly to the folder

---

# Setting the Working Directory

![:space 10]

Every time `R` boots up, it does so in the same place, unless we tell it to go somewhere else.

We can find out which directory we are in by using the `getwd()` function.

```r
getwd() # Get the current working directory
```
    /Users/edunford/

---

# Setting the Working Directory

![:space 10]

Every time `R` boots up, it does so in the same place, unless we tell it to go somewhere else.

We can then set a new working director by establishing the path to the folder we want to work in as a **string** in the function `setwd()`

```r
setwd("/Users/edunford/Desktop/my_data")
getwd()
```
     /Users/edunford/Desktop/my_data/

---

# Setting the WD via R Studio

![:space 10]

R Studio also makes setting the working directory really easy.

Click: `Session` &rarr; `Set Working Directory` &rarr; `Choose Directory...`

This will allow you to set the working directly quickly. The downside is that you have to do it **manually every time you return to this project**. By writing a script for everything you do, it is easier to replicate (and for others to replicate) your work.

---

# Establishing the Path

![:space 10]

Finally, we can also just point directly to the data by outlining the specific path.

Here we are assigning a sting containing our **path** to the object `path`.

```r
path <- "~/Desktop/my_data/data.csv"
```

We then load the data using that path.

```r
read.csv(path)
```

---

# Beyond Working Directories

![:space 10]

Working directories are limiting:

- If files are **moved** or **renamed**, a script won't run .
 
 
- Analyses cannot be easily transported across computers or users.

---

# Beyond Working Directories

The solutions:

1. **R Projects**

---

# Beyond Working Directories

The solutions:

1. **R Projects**

2. **R Projects** + the package [`here()`](https://github.com/jennybc/here_here)
  - To easily move around the project's subfiles
  - `here()` works like `file.path()`, but where the path root is implicitly set to “the path to the top-level of my current project”.