class: center, middle, inverse, title-slide #
PPOL670 | Introduction to Data Science for Public Policy
Week 2
Introduction to Programming in
R
###
Prof. Eric Dunford ◆ Georgetown University ◆ McCourt School of Public Policy ◆
eric.dunford@georgetown.edu
--- layout: true <div class="slide-footer"><span> PPOL670 | Introduction to Data Science for Public Policy           Week 2 <!-- Week of the Footer Here -->              Introduction to R <!-- Title of the lecture here --> </span></div> --- class: outline # Outline for Today ![:space 3] - **_Objects_** - **_Data Structures_** and **_Accessing_** Data Points - Mathematical and Logical **_Operators_** - **_Functions_** and **_Packages_** - **_Importing_** and **_Exporting_** data - **_Working directory_** and **_Projects_** --- class: newsection # Objects --- # Objects ![:space 10] R uses a specific set of rules to govern how it looks up values in the environment. We manage data by assigning it a name, and referencing that name when we need to use the information again. Officially, this is called `lexical scoping`, which comes from the computer science term "[lexing](https://en.wikipedia.org/wiki/Lexical_analysis)". Lexing is the process by which text represents meaningful pieces of information that the programming language understands. --- # Assigning an Object ![:space 10] In simple terms, an `object` is a bit of text that represents a specific value. ```r x <- 3 x ``` ``` ## [1] 3 ``` Here we've assigned the value `3` to the letter `x`. Whenever we type `x`, `R` understands that we really mean `3`. --- # Assigning an Object ![:space 5] There are three standard assignment operators: - `<-` - `=` - `assign()` "Best practice" is to use the `<-` assignment operator. ```r x1 <- 3 x2 = 3 assign("x3",3) c(x1, x2, x3) ``` ``` ## [1] 3 3 3 ``` --- # Assigning an Object ![:space 5] Note that lexical scoping is flexible. Objects can be written and re-written when necessary. ```r object <- 5 object ``` ``` ## [1] 5 ``` ```r object <- "A Very Vibrant Shade of Purple" object ``` ``` ## [1] "A Very Vibrant Shade of Purple" ``` <br> > Down the road it will help to give objects <u>meaningful names</u>! --- # Objects One can see all the objects in the environment by either looking at the user interface in RStudio (specifically, the **environment tab**)... .center[<img src="Figures/environment.png">] --- # Objects ![:space 10] One can see all the objects in the environment by either looking at the user interface in RStudio (specifically, the **environment tab**)... or by typing `ls()` in the console. ```r ls() ``` ``` ## [1] "binary" "message" "object" "owd" "x" "x1" "x2" "x3" ``` --- # Object Classes ![:space 10] Once assigned, an object has a **class**. A class describes the properties of the **data type** or **data structure** assigned to an object. We can use the function `class()` to find out what kind of data type or structure our object is. ```r class(x) ``` ``` ## [1] "numeric" ``` The object `x` is of class `numeric`, i.e. a number. --- # Object Classes There are [many classes](https://www.tutorialspoint.com/r/r_data_types.htm) that an object can take. ```r obj1 <- "This is a sentence" obj2 <- TRUE obj3 <- factor("This is a sentence") c(class(obj1),class(obj2),class(obj3)) ``` ``` ## [1] "character" "logical" "factor" ``` <br> <br> -- > Understanding what class of object one is dealing with is important --- as it will determine what kind of manipulations one can do or what functions an object will work with. --- # Object Classes ![:space 10] As noted, there are many different **data types** in `R`. We will primarily run into the following types: <br> <br> .center[ | Type | Example| |---|---| | Integer | `7` | | Numeric/Double | `4.56` | | Character | "Hello!" | | Logical | `TRUE` | | Factor | `"cat" (1)` | ] --- # Object Coercion ![:space 10] When need be, an object can be **coerced** to be a different class. ```r x ``` ``` ## [1] 3 ``` ```r as.character(x) ``` ``` ## [1] "3" ``` Here we transformed `x` -- which was an object containing the value `3` -- into a character. `x` is now a string with the text "3". --- # Removing objects from the Environment ![:space 10] We often want to get rid of objects after creating them. To **delete** (or drop) an object from the working directory, use the function `rm()` -- which stands for "remove". ```r ls() ``` ``` ## [1] "binary" "message" "obj1" "obj2" "obj3" "object" "owd" "x" "x1" "x2" ## [11] "x3" ``` ```r rm(x,x1,x2,x3,X) ls() ``` ``` ## [1] "binary" "message" "obj1" "obj2" "obj3" "object" "owd" ``` --- # Clearing the Environment We can also remove **<u>_all_</u>** objects from the environment at once by typing the following command. ```r rm(list=ls(all=T)) ``` Or we can do so from R Studio by clicking on the `broom icon`. .center[<img src="Figures/clearing_envir.png", width=500>] --- # Objects: So what's the point? ![:space 5] **Objects** offer a way to **reference different data**. This means that we can play around with _a lot_ of different data type simultaneously. This makes it easier to: -- + **manage** and use multiple datasets at the same time + **extract** and manipulate single variables + **work** with little bits of data at a time to make sure your calculations work. -- > Note that the only way to <u>hold onto information</u> is to assign it as an object! Else the information is printed but instantly forgotten by `R` --- class:newsection # Data Structures --- ## Data Structures ![:space 10] There are also many ways data can be **organized** in `R`. The same object can be organized in different ways depending on the needs to the user. Some commonly used data structures include: - `vector` - `matrix` - `data.frame` - `list` - `array` --- # Vector ![:space 10] ```r X <- c(1, 2, 4, 5, 44, 6, 10) X ``` ``` ## [1] 1 2 4 5 44 6 10 ``` ```r class(X) ``` ``` ## [1] "numeric" ``` ```r length(X) ``` ``` ## [1] 7 ``` --- # Data Frame ![:space 10] ```r data.frame(X) ``` ``` ## X ## 1 1 ## 2 2 ## 3 4 ## 4 5 ## 5 44 ## 6 6 ## 7 10 ``` --- # Matrix ![:space 10] ```r matrix(X) ``` ``` ## [,1] ## [1,] 1 ## [2,] 2 ## [3,] 4 ## [4,] 5 ## [5,] 44 ## [6,] 6 ## [7,] 10 ``` --- # List ![:space 10] ```r list(X) ``` ``` ## [[1]] ## [1] 1 2 4 5 44 6 10 ``` --- # Array ![:space 10] ```r array(X,dim = c(2,2,2)) ``` ``` ## , , 1 ## ## [,1] [,2] ## [1,] 1 4 ## [2,] 2 5 ## ## , , 2 ## ## [,1] [,2] ## [1,] 44 10 ## [2,] 6 1 ``` --- # The point... ![:space 5] - **Many ways to organize the same piece of information `R`** + different data structures afford us different advantages and bring with them different limitations. - We need to understand the **_structure_** of a data object to understand how to **_access_** the information inside. - As you become more acquainted with `R`, you'll see and use other types of data structures more often. - We'll rely mainly on a special type of `data.frame` called a `tibble` data frame. (More on this next time!) --- class:newsection # Indexing --- ## Structure Matters ![:space 5] One must understand the **structure** of a data object in order to systematically access the material contained within it. ![:space 5] As we saw, `R` allows for **_many different ways of organizing information_**. ![:space 5] To **_access information_** in a data structure, we'll need to rely on an **index** or **key**. --- ### Index An **_index_** is a positive integer that denotes the position of a data value in a data structure. ![:space 2] `R` uses a **_`1`-based_** index system, which means we start counting at 1 as the first position in the data structure. ![:space 2] We use **_brackets_**, `[ ]`, to index data values. The brackets operator goes along side the object name, `x[]`. ![:space 2] We then **_insert an integer (or a vector of integers)_** denoting the position of the value(s) we want into the bracket. --- ### Index Consider the following `vector` which contains five character data values. Under the hood, there lies the _`1`-based index that references the position of each data point_. ``` vec <- c("A","B","C","D","E") ^ ^ ^ ^ ^ 1 2 3 4 5 ``` To access any **_individual_** value, we simply need to reference it's position. ```r vec[3] ``` ``` ## [1] "C" ``` --- ### Index Consider the following `vector` which contains five character data values. Under the hood, there lies the _`1`-based index that references the position of each data point_. ``` vec <- c("A","B","C","D","E") ^ ^ ^ ^ ^ 1 2 3 4 5 ``` To access any **_multiple_** values, we need to supply a vector of index positions. ```r vec[c(1,3,5)] ``` ``` ## [1] "A" "C" "E" ``` --- ### Index Consider the following `vector` which contains five character data values. Under the hood, there lies the _`1`-based index that references the position of each data point_. ``` vec <- c("A","B","C","D","E") ^ ^ ^ ^ ^ 1 2 3 4 5 ``` If we reference positions that **_exceeds the bounds_**, `R` returns a missing value or `NA` ```r vec[6] ``` ``` ## [1] NA ``` --- ### Index Consider the following `vector` which contains five character data values. Under the hood, there lies the _`1`-based index that references the position of each data point_. ``` vec <- c("A","B","C","D","E") ^ ^ ^ ^ ^ 1 2 3 4 5 ``` A **_negative index_** tells `R` to **_exclude_** that data value while returning the rest. ```r vec[-4] ``` ``` ## [1] "A" "B" "C" "E" ``` --- ### Indexing in two-dimensions A `vector` is a **_one-dimensional_** data structure, so there is only 1 index to keep track of. When accessing data points in a `data.frame` or `matrix`, we need to keep track of **_2 dimensions_** ``` 1 2 ^ ^ var_1 var_2 1 < "a" 2.3 2 < "b" 1.2 3 < "c" 3.4 ``` We need to keep track of two indices: one for the rows, and one for the columns .center[ `data[`<font color = 'red'>`row`</font>`,`<font color = 'blue'>`column`</font>`]` ] --- <img src="lecture-week-02_intro-to-R_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> --- <img src="lecture-week-02_intro-to-R_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> --- <img src="lecture-week-02_intro-to-R_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> --- <img src="lecture-week-02_intro-to-R_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" /> --- <img src="lecture-week-02_intro-to-R_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" /> --- <img src="lecture-week-02_intro-to-R_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" /> --- <img src="lecture-week-02_intro-to-R_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" /> --- <img src="lecture-week-02_intro-to-R_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" /> <img src="lecture-week-02_intro-to-R_files/figure-html/unnamed-chunk-31-1.png" style="display: block; margin: auto;" /> --- ### Indexing in two-dimensions Let's use a dataset inherent to `R` called `cars`. There are a number of datasets that are built into `R`. These are for demonstration purposes. Note that these data will not appear in the environment **_until we assign them to an object_**. ```r data <- cars class(data) ``` ``` ## [1] "data.frame" ``` --- ### Indexing in two-dimensions We can look at the **structure** of a data object by using the `str()` function. ```r str(data) ``` ``` ## 'data.frame': 50 obs. of 2 variables: ## $ speed: num 4 4 7 7 8 9 10 10 10 11 ... ## $ dist : num 2 10 4 22 16 10 18 26 34 17 ... ``` -- The function `dim()` can tell use about the dimensions of a data object. ```r dim(data) ``` ``` ## [1] 50 2 ``` --- ### Indexing in two-dimensions We can look at the **structure** of a data object by using the `str()` function. ```r str(data) ``` ``` ## 'data.frame': 50 obs. of 2 variables: ## $ speed: num 4 4 7 7 8 9 10 10 10 11 ... ## $ dist : num 2 10 4 22 16 10 18 26 34 17 ... ``` Or query the number of columns or row directly with `ncol()`/`nrow()` ```r ncol(data) ``` ``` ## [1] 2 ``` ```r nrow(data) ``` ``` ## [1] 50 ``` --- ### Indexing in two-dimensions We can look at the **structure** of a data object by using the `str()` function. ```r str(data) ``` ``` ## 'data.frame': 50 obs. of 2 variables: ## $ speed: num 4 4 7 7 8 9 10 10 10 11 ... ## $ dist : num 2 10 4 22 16 10 18 26 34 17 ... ``` ![:space 5] **The Point**: _one needs to know the dimensions of a relational data structure to look up values_. --- ## Keys Some data structures (`data.frame`s and named `list`s) have **_keys_** that allow us to look up data values. **_Keys_** are a unique identifier (usually a character value) that we can use to look up data values. -- For a `data.frame` these keys take the form of **_variable names_** that provide a unique identifier for each column. We can look up these variable names using the `colnames()` function. ```r colnames(data) ``` ``` ## [1] "speed" "dist" ``` --- ## Looking up data values with keys We can access a data object's keys using the `$` operator. `$` acts as a **_handle_** by which we can look up all available keys and extract a specific data feature. If we hit **Tab** after specifying the `$` after our data object, R Studio will offer a list of all available variables. .center[<img src="Figures/sign_in.png" align="middle">] --- ## Looking up data values with keys ![:space 10] Here we call the `speed` variable from our dataset using the `$` and the variable name (key). ```r data$speed ``` ``` ## [1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 16 16 17 17 17 18 18 ## [34] 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25 ``` --- ## Looking up data values with keys ![:space 10] We can also reference the key directly using the `[]` brackets operator and the key name. ```r data[ , "speed"] ``` ``` ## [1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 16 16 17 17 17 18 18 ## [34] 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25 ``` --- ## Looking up data values with keys We'll come across specialized data types, like the output from a **_linear model_**. ```r m <- lm(dist ~ speed, data = data) str(m) ``` ``` ## List of 12 ## $ coefficients : Named num [1:2] -17.58 3.93 ## ..- attr(*, "names")= chr [1:2] "(Intercept)" "speed" ## $ residuals : Named num [1:50] 3.85 11.85 -5.95 12.05 2.12 ... ## ..- attr(*, "names")= chr [1:50] "1" "2" "3" "4" ... ## $ effects : Named num [1:50] -303.914 145.552 -8.115 9.885 0.194 ... ## ..- attr(*, "names")= chr [1:50] "(Intercept)" "speed" "" "" ... ## $ rank : int 2 ## $ fitted.values: Named num [1:50] -1.85 -1.85 9.95 9.95 13.88 ... ## ..- attr(*, "names")= chr [1:50] "1" "2" "3" "4" ... ## $ assign : int [1:2] 0 1 ## $ qr :List of 5 ## ..$ qr : num [1:50, 1:2] -7.071 0.141 0.141 0.141 0.141 ... ## .. ..- attr(*, "dimnames")=List of 2 ## .. .. ..$ : chr [1:50] "1" "2" "3" "4" ... ## .. .. ..$ : chr [1:2] "(Intercept)" "speed" ## .. ..- attr(*, "assign")= int [1:2] 0 1 ## ..$ qraux: num [1:2] 1.14 1.27 ## ..$ pivot: int [1:2] 1 2 ## ..$ tol : num 1e-07 ## ..$ rank : int 2 ## ..- attr(*, "class")= chr "qr" ## $ df.residual : int 48 ## $ xlevels : Named list() ## $ call : language lm(formula = dist ~ speed, data = data) ## $ terms :Classes 'terms', 'formula' language dist ~ speed ## .. ..- attr(*, "variables")= language list(dist, speed) ## .. ..- attr(*, "factors")= int [1:2, 1] 0 1 ## .. .. ..- attr(*, "dimnames")=List of 2 ## .. .. .. ..$ : chr [1:2] "dist" "speed" ## .. .. .. ..$ : chr "speed" ## .. ..- attr(*, "term.labels")= chr "speed" ## .. ..- attr(*, "order")= int 1 ## .. ..- attr(*, "intercept")= int 1 ## .. ..- attr(*, "response")= int 1 ## .. ..- attr(*, ".Environment")=<environment: 0x7fec4e6d43b0> ## .. ..- attr(*, "predvars")= language list(dist, speed) ## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric" ## .. .. ..- attr(*, "names")= chr [1:2] "dist" "speed" ## $ model :'data.frame': 50 obs. of 2 variables: ## ..$ dist : num [1:50] 2 10 4 22 16 10 18 26 34 17 ... ## ..$ speed: num [1:50] 4 4 7 7 8 9 10 10 10 11 ... ## ..- attr(*, "terms")=Classes 'terms', 'formula' language dist ~ speed ## .. .. ..- attr(*, "variables")= language list(dist, speed) ## .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1 ## .. .. .. ..- attr(*, "dimnames")=List of 2 ## .. .. .. .. ..$ : chr [1:2] "dist" "speed" ## .. .. .. .. ..$ : chr "speed" ## .. .. ..- attr(*, "term.labels")= chr "speed" ## .. .. ..- attr(*, "order")= int 1 ## .. .. ..- attr(*, "intercept")= int 1 ## .. .. ..- attr(*, "response")= int 1 ## .. .. ..- attr(*, ".Environment")=<environment: 0x7fec4e6d43b0> ## .. .. ..- attr(*, "predvars")= language list(dist, speed) ## .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric" ## .. .. .. ..- attr(*, "names")= chr [1:2] "dist" "speed" ## - attr(*, "class")= chr "lm" ``` --- ## Looking up data values with keys We'll come across specialized data types, like the output from a **_linear model_**. But don't worry, these are just **_named lists_** that use keys as indices. ```r names(m) ``` ``` ## [1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign" ## [7] "qr" "df.residual" "xlevels" "call" "terms" "model" ``` -- We can use those names to look up specific output from what looks like a complex object. It's that easy. ```r m$coefficients ``` ``` ## (Intercept) speed ## -17.579095 3.932409 ``` --- class:newsection # Operators --- ## Mathematical Operators Broadly speaking, `R` functions as general calculator that can process a variety of data types. As we can see, most operators in `R` are the usual suspects, but some forms are particular to `R`. .center[ | Operation | Calc | Out | |-----------------|-----------------|-----------------| |Addition | `3 + 4` | `7`| |Subtraction | `3 - 4` | `-1`| |Multiplication | `3 * 4` | `12`| |Division | `3 / 4` | `.75`| |Exponentiation | `3 ^ 4` | `81`| |Modulo | `4%%3` | `1`| ] In the example, we'll walk through a few more operators. --- # Mathematical Functions ![:space 10] There are a range of functions designed to ease mathematical calculations. Some of these functions are to calculate specific values, such as the **natural log** or **Euler's number** ($e^a$). ```r log(4) ``` ``` ## [1] 1.386294 ``` ```r exp(5) ``` ``` ## [1] 148.4132 ``` --- ![:space 10] There are a range of functions designed to ease mathematical calculations. Others can be used to find the **sum** for a numerical vector, the **mean**, or the **median** ```r x <- c(1,3,7,100) sum(x) ``` ``` ## [1] 111 ``` ```r mean(x) ``` ``` ## [1] 27.75 ``` ```r median(x) ``` ``` ## [1] 5 ``` --- # Logical Operators Boolean statement (i.e. true/false statements) are central to any computer programming environment. Boolean statements allow us to make quick conditional evaluations, which are key to **subsetting** data. -- The following outlines the various types of boolean statements available. ```r x == y # equals to x != y # does not equal x >= y # greater than or equal to x <= y # less than or equal to x > y # greater than x < y # less than ``` -- Statements can be combined using **and** (`&`) **or** (`|`) statements to make more specific queries. ```r x==1 & y==5 # "and" conditional statements x==1 | y==5 # "or" conditional statements ``` --- ![:space 10] Boolean statements can be fed directly into data objects via the brackets method `[]`. This offers a powerful and simple way to subset data. ```r x <- c(1,33,100,.6,5,77) x ``` ``` ## [1] 1.0 33.0 100.0 0.6 5.0 77.0 ``` ```r x[x > 30] ``` ``` ## [1] 33 100 77 ``` --- ![:space 10] There are also a number of base functions that provide useful boolean evaluations. Here are just a few examples... ```r is.character("hello") # for class ``` ``` ## [1] TRUE ``` ```r all(c(T,F,F)) # are all entries True? ``` ``` ## [1] FALSE ``` ```r identical(1+1,2) # are these entries the same? ``` ``` ## [1] TRUE ``` --- ![:space 10] Finally, boolean statements have a nice property in `R`. If we convert a boolean statement to a **numeric class**, `TRUE` values convert to `1` and `FALSE` values convert to `0`. This offers us a quick way of generating **dichotomous** values. ```r x <- 1:10 x >= 5 ``` ``` ## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE ``` ```r as.numeric(x >= 5) ``` ``` ## [1] 0 0 0 0 1 1 1 1 1 1 ``` --- ## Subsetting with logical operators We can combine what we know about logical and accessing the columns and rows in a relational `data.frame` to a powerful effect. -- ```r d <- data.frame(x = c(100,200,300,400), y = c("a","b","b","a")) d ``` ``` ## x y ## 1 100 a ## 2 200 b ## 3 300 b ## 4 400 a ``` -- ```r d[ d$x > 2, ] ``` ``` ## x y ## 1 100 a ## 2 200 b ## 3 300 b ## 4 400 a ``` --- class: newsection # Functions --- # What are functions? ![:space 10] A **function** is a type of object in `R` that can perform a specific task. Unlike objects that hold data, functions take **arguments** as input and output some manipulated form of the inputed data. -- A function is specified first with the object name and then parentheses. For example, the function `log()` calculates the natural log of any number placed inside the parentheses. ```r log(4) ``` ``` ## [1] 1.386294 ``` --- # Where are functions exactly? ![:space 10] Functions operate in the **background**. There are a number of functions in `R`, known as **base functions**, that are always running when you turn `R` on. When we need to do things that are <u>**not**</u> a part of the base functionality, we can import new functions by installing **packages**. But more on this later. --- # Some common functions ![:space 5] We've already come a across a few functions, and we'll learn a lot more moving forward. Just keep in mind that whenever something is wrapped in parentheses `()`, it's a function. Here are examples of a few common base functions that we'll see. .center[ | Function | Description | | :---: | :---: | | `c()` | links entries together as a vector | | `as.character()` | coerces the input to be a character class | | `length()` | reports how "long" a vector or data frame is | | `dim()` | reports the dimensions of a data frame | | `class()` | reports the class of an object | ] --- ## Figuring out what a function does... All functions in `R` contain rich documentation regarding how a function works, the inputs it requires, and example code. We can access this documentation by using `?` in front of the function. ```r ?c() ``` <img src="Figures/function_help.png", align="middle"> --- class:newsection # Packages --- ## R Packages ![:space 10] There are a number of `packages` that are supplied with the R distribution. These are known as "[base packages](https://stat.ethz.ch/R-manual/R-devel/library/base/html/00Index.html)" and they are in the background the second one starts a session in R. -- - A **`package`** is a set of functions and programs that perform specific tasks. - By installing packages, **we introduce new forms of functionality to the R environment**. --- ## R Packages ![:space 10] To use the content in a package, one first needs to **install it**. One can do this by utilizing the following function: `install.packages()`. By inserting the name of a specific package, we can connect to an R "mirror" and download the binary of the package. ```r install.packages("tidyverse") ``` The version of that package is then saved on your computer and can be called at any time (on or offline). --- ## R Packages ![:space 10] Once installed, it's on the system for good. You can then reference or load the package any time you wish to use a function from it. There are two functions we can use to load a package: `library()` and `require()`. ```r library(tidyverse) # or require(tidyverse) ``` > You must <u>load</u> the package before you can use any function in it. --- `R Studio` also offers us a way to install packages through the interface. If we click on the `Packages` tab and then click `Install`, we can download a package by typing its name. <img src="Figures/install_packages.png"> --- We then can **load** the package from R Studio by clicking the check box beside the packages name. <img src="Figures/load_package.png"> --- Sometimes one has _a lot_ of packages running simultaneously. No problem: we can see what packages are up and running by typign `sessionInfo()` into the console. This will tell us everything about the version of R and the packages we are using to run our analysis. ```r sessionInfo() ``` ``` ## R version 3.6.2 (2019-12-12) ## Platform: x86_64-apple-darwin15.6.0 (64-bit) ## Running under: macOS Catalina 10.15.5 ## ## Matrix products: default ## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib ## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib ## ## Random number generation: ## RNG: Mersenne-Twister ## Normal: Inversion ## Sample: Rounding ## ## locale: ## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] shiny_1.4.0 bindrcpp_0.2.2 forcats_0.5.0 stringr_1.4.0 dplyr_1.0.0 purrr_0.3.4 ## [7] readr_1.3.1 tidyr_1.1.0 tibble_3.0.3 ggplot2_3.3.2 tidyverse_1.3.0 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_1.0.5 lubridate_1.7.4 assertthat_0.2.1 digest_0.6.25 packrat_0.5.0 ## [6] mime_0.9 R6_2.4.1 cellranger_1.1.0 backports_1.1.7 reprex_0.3.0 ## [11] evaluate_0.14 httr_1.4.1 xaringan_0.14 pillar_1.4.4 rlang_0.4.7 ## [16] readxl_1.3.1 rstudioapi_0.11 rmarkdown_2.3 servr_0.15 munsell_0.5.0 ## [21] broom_0.7.0 compiler_3.6.2 httpuv_1.5.2 modelr_0.1.6 xfun_0.12 ## [26] pkgconfig_2.0.3 htmltools_0.4.0 tidyselect_1.1.0 codetools_0.2-16 fansi_0.4.1 ## [31] crayon_1.3.4 dbplyr_1.4.2 withr_2.2.0 later_1.1.0.1 grid_3.6.2 ## [36] jsonlite_1.7.0 xtable_1.8-4 gtable_0.3.0 lifecycle_0.2.0 DBI_1.1.0 ## [41] magrittr_1.5 scales_1.1.0 cli_2.0.2 stringi_1.4.6 farver_2.0.3 ## [46] fs_1.3.2 promises_1.1.0 xml2_1.2.2 ellipsis_0.3.1 generics_0.0.2 ## [51] vctrs_0.3.1 tools_3.6.2 glue_1.4.1 hms_0.5.3 rsconnect_0.8.16 ## [56] fastmap_1.0.1 yaml_2.2.1 colorspace_1.4-1 rvest_0.3.5 knitr_1.28 ## [61] bindr_0.1.1 haven_2.3.1 ``` --- ## Remember to Load Your Package! If you ever try to run a function and you get the following prompt... Error: could not find function "qplot" It's likely you forgot to <font size=10 color="blue", style="bold"> <u>load the package</u> </font>. ```r require(ggplot2) # First Load the package qplot() # Then run the function # Wah-la! ``` --- class:newsection # Importing & Exporting Data --- `R` allows you to import a large variety of datasets into the environment. However, `R`'s base packages <u>only support a few data types</u>. -- No Fear: there is usually always an **external package** that can do the job! We are going to focus on **three packages** to import different data types: - `readr` --- an expansive array of functions to read different data types - `readxl` --- for excel spreadsheets - `haven` --- for SPSS, SAS, and .dta --- First, we need to **install** these packages onto our computer. ```r install.packages("readr") install.packages("readxl") install.packages("haven") ``` And then **load** them into our current `R` Session. ```r require(readr) require(readxl) require(haven) ``` --- # Importing data Here we will review how to import five separate data types: - `.dta` --- STATA file format - `.csv` --- comma seperated file format - `.sav` --- SPSS file format - `.xlsx` --- standard Excel file format - `.Rdata` --- R's file format --- # .dta ![:space 10] For all versions of STATA ```r require(haven) data <- read_dta(file = "data.dta") ``` <br> Other packages: - `readstata13` - `foreign` --- # .csv `read.csv()` and `read.table()` are both **base functions** in `R`. ```r data <- read.csv(file = "data.csv", stringsAsFactors = F) # Or data <- read.table(file = "data.csv", header = T, sep=",", stringsAsFactors = F) ``` These functions have specific **arguments** that we are referencing: - `stringsAsFactors` means that we don't want all `character` vectors in the `data.frame` to be converted to `factors`. - `header` means the first row of the data are column names. - `sep` means that entries are separated by commas. --- # .csv ![:space 10] The `readr` package provides a much simpler approach. ```r require(readr) data <- read_csv("data.csv") ``` - `characters` aren't converted to `factors`. - More efficient as `\(N\)` increase --- # .sav ![:space 10] For `SPSS` and `SAS` file formats, the `haven` packages offers a simple way of reading in data. ```r require(haven) data <- read_sav(file = "data.sav") # SPSS ``` --- # .xlsx ![:space 10] ```r require(readxl) data <- read_excel("data.xlsx") ``` Even select from specific sheets. ```r excel_sheets("data.xlsx") # list avail. sheets ``` [1] Sheet1, Sheet2 ```r data <- read_excel("data.xlsx", sheet = 'Sheet1') ``` --- # .Rdata ![:space 10] `.Rdata` is the data source inherent to `R`. It saves and loads `objects`. ```r load(file='data.Rdata') ``` --- # Importing Data Using R Studio There is also a point-and-click option for importing and exporting data in R. If we go into the `Environments` tab and then click `Import Dataset` <img src="Figures/importing.png"> --- # Exporting data Exporting data is the same process in reverse. Instead of **reading** the data, we want to **write** a new version of it. There are a series of functions (each provided by their respective packages) that allow us to do just that. Each require that you input the **data** that you're looking to export and specify the **file name** and **paths** to tell the computer where the file is going. --- # Exporting data ![:space 10] ```r write_dta(data,path ="data.dta") write_csv(data,path ="data.csv") write_sav(data,path ="data.sav") write_sas(data,path ="data.sas") write_tsv(data,path ="data.tab") # etc. ``` --- # .Rdata ![:space 10] `.Rdata` offers two options to save data. We can either save a single data object, or save the entire workspace ```r # Save just an object save(data, file="data.Rdata") # Save the entire workspace save.image(file="workspace.Rdata") ``` --- class: newsection # But where is my data exactly? --- # But where is my data exactly? ![:space 10] `R` doesn't intuitively know where your data is. If the data is in a special folder entitled "`my_data`", we have to tell `R` how to get there. We can do this three ways: -- 1. Set the **working directory** to that folder 2. Set the directory via a point-and-click option in `R Studio` 3. Establish the **path** to that directly to the folder --- # Setting the Working Directory ![:space 10] Every time `R` boots up, it does so in the same place, unless we tell it to go somewhere else. We can find out which directory we are in by using the `getwd()` function. ```r getwd() # Get the current working directory ``` /Users/edunford/ --- # Setting the Working Directory ![:space 10] Every time `R` boots up, it does so in the same place, unless we tell it to go somewhere else. We can then set a new working director by establishing the path to the folder we want to work in as a **string** in the function `setwd()` ```r setwd("/Users/edunford/Desktop/my_data") getwd() ``` /Users/edunford/Desktop/my_data/ --- # Setting the WD via R Studio ![:space 10] R Studio also makes setting the working directory really easy. Click: `Session` → `Set Working Directory` → `Choose Directory...` This will allow you to set the working directly quickly. The downside is that you have to do it **manually every time you return to this project**. By writing a script for everything you do, it is easier to replicate (and for others to replicate) your work. --- # Establishing the Path ![:space 10] Finally, we can also just point directly to the data by outlining the specific path. Here we are assigning a sting containing our **path** to the object `path`. ```r path <- "~/Desktop/my_data/data.csv" ``` We then load the data using that path. ```r read.csv(path) ``` --- # Beyond Working Directories ![:space 10] Working directories are limiting: - If files are **moved** or **renamed**, <font color = "red"> a script won't run </font>. <br> <br> - Analyses cannot be easily transported across computers or users. --- # Beyond Working Directories The solutions: 1. **R Projects** .center[<img src="Figures/rproj-activate.png", width=400px>] .center[<img src="Figures/rproj-specify.png", width=400px>] --- # Beyond Working Directories The solutions: 1. **R Projects** 2. **R Projects** + the package [`here()`](https://github.com/jennybc/here_here) - To easily move around the project's subfiles - `here()` works like `file.path()`, but where the path root is implicitly set to “the path to the top-level of my current project”.