recipes
packageIn this walkthrough, we’ll explore how pre-process data using the recipes
package. Data preprocessing is the process of preparing our data for modeling. As we’ve seen, data comes in all shapes and sizes. Some data is dirty and needs to be cleaned before building a model with it. Specifically, data can be:
Pre-processing is necessary step used to resolve these issues.Some common pre-processing tasks are:
Scaling and transforming continuous values
Converting categorical variables to dummy variables.
Detecting and imputing missing values
Here are the packages we’ll use today in this walkthrough.
require(tidyverse)
require(recipes) # For data pre-processing
require(rsample) # For generating train-test splits
The provided data captures whether or not a person will pay back a bank loan. The outcome variable is status
, which is a qualitative outcome that takes one of two values: good
if the individual has good credit, otherwise bad
. There are 13
other variables tracking features about the debtor and the loan.
We’ll encounter this data later on when we learn more about modeling. Right now, we’ll use it to walkthrough pre-processing concepts.
Parsed with column specification:
cols(
Status = col_character(),
Seniority = col_double(),
Home = col_character(),
Time = col_double(),
Age = col_double(),
Marital = col_character(),
Records = col_character(),
Job = col_character(),
Expenses = col_double(),
Income = col_double(),
Assets = col_double(),
Debt = col_double(),
Amount = col_double(),
Price = col_double()
)
Before doing anything with our data, we want to split it into a training and test dataset. We don’t want to learn anything about our test data, and that even includes summary statistics. Thus before we look at and explore our data (to figure out how to preprocess it), we want to split it into a training and test dataset.
We can easily break our data up into a training and test dataset using the rsample
package. First, we want to generate a split object using the initial_split()
function. Note that the prop=
argument dictates the proportion of the data we want to keep as training data.
set.seed(123) # We set a seed so we can reproduce the random split
splits <- initial_split(credit_data,prop = .75)
splits
<Analysis/Assess/Total>
<3341/1113/4454>
Then break into training and test datasets.
train_dat <- training(splits)
test_dat <- testing(splits)
If we look at the number of observations, we see that we get precisely the proportions that were requested in each dataset.
nrow(train_dat)/nrow(credit_data)
[1] 0.7501123
nrow(test_dat)/nrow(credit_data)
[1] 0.2498877
Note: never look at the training data. Maintaining this rule is how we can do good social science using machine learning methods. More on this later.
Data Types: Note we have variables of different classes/types. How we preprocess these data will differ given its type.
train_dat %>%
summarize_all(class) %>%
glimpse()
Rows: 1
Columns: 14
$ status <chr> "factor"
$ seniority <chr> "numeric"
$ home <chr> "factor"
$ time <chr> "numeric"
$ age <chr> "numeric"
$ marital <chr> "factor"
$ records <chr> "factor"
$ job <chr> "factor"
$ expenses <chr> "numeric"
$ income <chr> "numeric"
$ assets <chr> "numeric"
$ debt <chr> "numeric"
$ amount <chr> "numeric"
$ price <chr> "numeric"
Of the numberic variable, we clearly have variables on different scales… with some missing values!
train_dat %>%
summarize_if(is.numeric, function(x) mean(x)) %>%
glimpse()
Rows: 1
Columns: 9
$ seniority <dbl> 7.973661
$ time <dbl> 46.57947
$ age <dbl> 37.06435
$ expenses <dbl> 55.48399
$ income <dbl> NA
$ assets <dbl> NA
$ debt <dbl> NA
$ amount <dbl> 1044.601
$ price <dbl> 1470.798
We can drop missing values when calculating a summary statistic with the na.rm = T
argument.
train_dat %>%
summarize_if(is.numeric, function(x) mean(x,na.rm = T)) %>%
glimpse()
Rows: 1
Columns: 9
$ seniority <dbl> 7.973661
$ time <dbl> 46.57947
$ age <dbl> 37.06435
$ expenses <dbl> 55.48399
$ income <dbl> 140.7199
$ assets <dbl> 5285.176
$ debt <dbl> 320.4593
$ amount <dbl> 1044.601
$ price <dbl> 1470.798
Of the categorical variable, we also have missing values
train_dat %>%
select_if(is.factor) %>%
glimpse()
Rows: 3,341
Columns: 5
$ status <fct> good, bad, good, good, good, good, bad, good, good, good, good, bad, good…
$ home <fct> rent, owner, rent, owner, owner, parents, parents, owner, owner, owner, p…
$ marital <fct> married, married, single, married, married, single, married, married, mar…
$ records <fct> no, yes, no, no, no, no, no, no, no, no, no, yes, no, no, no, no, no, no,…
$ job <fct> freelance, freelance, fixed, fixed, fixed, fixed, partime, freelance, fix…
train_dat %>%
select_if(is.factor) %>%
summarize_all(function(x) sum(is.na(x)))
How is the continuous data distributed? And what does this tell us?
train_dat %>%
# only select the numeric variables
select_if(is.numeric) %>%
# Pivot to longer data format (for faceting)
pivot_longer(cols = everything()) %>%
# Plot histograms for each variable
ggplot(aes(value)) +
geom_histogram() +
facet_wrap(~name,scales="free",ncol=3)