recipes
packageIn this walkthrough, we’ll explore how pre-process data using the recipes
package. Data preprocessing is the process of preparing our data for modeling. As we’ve seen, data comes in all shapes and sizes. Some data is dirty and needs to be cleaned before building a model with it. Specifically, data can be:
Pre-processing is necessary step used to resolve these issues.Some common pre-processing tasks are:
Scaling and transforming continuous values
Converting categorical variables to dummy variables.
Detecting and imputing missing values
Here are the packages we’ll use today in this walkthrough.
require(tidyverse)
require(recipes) # For data pre-processing
require(rsample) # For generating train-test splits
The provided data captures whether or not a person will pay back a bank loan. The outcome variable is status
, which is a qualitative outcome that takes one of two values: good
if the individual has good credit, otherwise bad
. There are 13
other variables tracking features about the debtor and the loan.
We’ll encounter this data later on when we learn more about modeling. Right now, we’ll use it to walkthrough pre-processing concepts.
Parsed with column specification:
cols(
Status = col_character(),
Seniority = col_double(),
Home = col_character(),
Time = col_double(),
Age = col_double(),
Marital = col_character(),
Records = col_character(),
Job = col_character(),
Expenses = col_double(),
Income = col_double(),
Assets = col_double(),
Debt = col_double(),
Amount = col_double(),
Price = col_double()
)
Before doing anything with our data, we want to split it into a training and test dataset. We don’t want to learn anything about our test data, and that even includes summary statistics. Thus before we look at and explore our data (to figure out how to preprocess it), we want to split it into a training and test dataset.
We can easily break our data up into a training and test dataset using the rsample
package. First, we want to generate a split object using the initial_split()
function. Note that the prop=
argument dictates the proportion of the data we want to keep as training data.
set.seed(123) # We set a seed so we can reproduce the random split
splits <- initial_split(credit_data,prop = .75)
splits
<Analysis/Assess/Total>
<3341/1113/4454>
Then break into training and test datasets.
train_dat <- training(splits)
test_dat <- testing(splits)
If we look at the number of observations, we see that we get precisely the proportions that were requested in each dataset.
nrow(train_dat)/nrow(credit_data)
[1] 0.7501123
nrow(test_dat)/nrow(credit_data)
[1] 0.2498877
Note: never look at the training data. Maintaining this rule is how we can do good social science using machine learning methods. More on this later.
Data Types: Note we have variables of different classes/types. How we preprocess these data will differ given its type.
train_dat %>%
summarize_all(class) %>%
glimpse()
Rows: 1
Columns: 14
$ status <chr> "factor"
$ seniority <chr> "numeric"
$ home <chr> "factor"
$ time <chr> "numeric"
$ age <chr> "numeric"
$ marital <chr> "factor"
$ records <chr> "factor"
$ job <chr> "factor"
$ expenses <chr> "numeric"
$ income <chr> "numeric"
$ assets <chr> "numeric"
$ debt <chr> "numeric"
$ amount <chr> "numeric"
$ price <chr> "numeric"
Of the numberic variable, we clearly have variables on different scales… with some missing values!
train_dat %>%
summarize_if(is.numeric, function(x) mean(x)) %>%
glimpse()
Rows: 1
Columns: 9
$ seniority <dbl> 7.973661
$ time <dbl> 46.57947
$ age <dbl> 37.06435
$ expenses <dbl> 55.48399
$ income <dbl> NA
$ assets <dbl> NA
$ debt <dbl> NA
$ amount <dbl> 1044.601
$ price <dbl> 1470.798
We can drop missing values when calculating a summary statistic with the na.rm = T
argument.
train_dat %>%
summarize_if(is.numeric, function(x) mean(x,na.rm = T)) %>%
glimpse()
Rows: 1
Columns: 9
$ seniority <dbl> 7.973661
$ time <dbl> 46.57947
$ age <dbl> 37.06435
$ expenses <dbl> 55.48399
$ income <dbl> 140.7199
$ assets <dbl> 5285.176
$ debt <dbl> 320.4593
$ amount <dbl> 1044.601
$ price <dbl> 1470.798
Of the categorical variable, we also have missing values
train_dat %>%
select_if(is.factor) %>%
glimpse()
Rows: 3,341
Columns: 5
$ status <fct> good, bad, good, good, good, good, bad, good, good, good, good, bad, good…
$ home <fct> rent, owner, rent, owner, owner, parents, parents, owner, owner, owner, p…
$ marital <fct> married, married, single, married, married, single, married, married, mar…
$ records <fct> no, yes, no, no, no, no, no, no, no, no, no, yes, no, no, no, no, no, no,…
$ job <fct> freelance, freelance, fixed, fixed, fixed, fixed, partime, freelance, fix…
train_dat %>%
select_if(is.factor) %>%
summarize_all(function(x) sum(is.na(x)))
How is the continuous data distributed? And what does this tell us?
train_dat %>%
# only select the numeric variables
select_if(is.numeric) %>%
# Pivot to longer data format (for faceting)
pivot_longer(cols = everything()) %>%
# Plot histograms for each variable
ggplot(aes(value)) +
geom_histogram() +
facet_wrap(~name,scales="free",ncol=3)
The package operates by laying out a series of steps that are then itemized. Once we have all our steps in place we then bake
the recipe (i.e. execute and transform the data all at once).
step_
s we need to perform:
impute any missing values.
Log transform amount
, assets
, debt
, income
, seniority
, expenses
and price
scale the continuous variables
convert the categorical variables to dummy variables
First, let’s initialize the recipe object.
our_recipe <- recipe(status ~ ., data = train_dat)
our_recipe
Data Recipe
Inputs:
recipes
offers many different forms of imputation.
Imputation Methods |
---|
step_bagimpute |
step_knnimpute |
step_lowerimpute |
step_meanimpute |
step_medianimpute |
step_modeimpute |
step_rollimpute |
step_bagimpute |
step_knnimpute |
step_meanimpute |
step_rollimpute |
See the package documentation on how any one of these methods works (.e.g step_knnimpute
).
Below I’m using mean imputation for all numeric variables, which fills in the average value for variable to fill in any missing values, and mode imputation for the categorical variables, which fills in the most common category for any missing values.
Note the all_numeric()
and all_nominal()
functions. This tells the recipe to perform a particular step on certain types of variables. We can directly reference specific variables using the variable name.
our_recipe <-
our_recipe %>%
step_meanimpute(all_numeric()) %>%
step_modeimpute(all_nominal())
our_recipe
Data Recipe
Inputs:
Operations:
Mean Imputation for all_numeric()
Mode Imputation for all_nominal()
Recall there was a right skew in the following variables: amount
, assests
, debt
, income
, seniority
, expenses
and price
. We can log transform these features so that the extreme values in their tails exhibit less of an influence.
The offset=1
argument adds one to each of the variables before logging. Why do you think this is?
our_recipe <-
our_recipe %>%
step_log(amount,assets,debt,income,
seniority,expenses,price,offset = 1)
our_recipe
Data Recipe
Inputs:
Operations:
Mean Imputation for all_numeric()
Mode Imputation for all_nominal()
Log transformation on amount, assets, debt, income, seniority, expenses, price
our_recipe <-
our_recipe %>%
step_normalize(all_numeric()) # Center mean around 0 and Set variance to 1
our_recipe
Data Recipe
Inputs:
Operations:
Mean Imputation for all_numeric()
Mode Imputation for all_nominal()
Log transformation on amount, assets, debt, income, seniority, expenses, price
Centering and scaling for all_numeric()
our_recipe <-
our_recipe %>%
step_dummy(all_nominal())
our_recipe
Data Recipe
Inputs:
Operations:
Mean Imputation for all_numeric()
Mode Imputation for all_nominal()
Log transformation on amount, assets, debt, income, seniority, expenses, price
Centering and scaling for all_numeric()
Dummy variables from all_nominal()
prep()
calculates all the necessary statistics and values for the transformations. It then stores this information for use later on. When might we use this data later?
prepared_recipe <- our_recipe %>% prep()
prepared_recipe
Data Recipe
Inputs:
Training data contained 3341 data points and 312 incomplete rows.
Operations:
Mean Imputation for seniority, time, age, expenses, income, assets, ... [trained]
Mode Imputation for home, marital, records, job, status [trained]
Log transformation on amount, assets, debt, income, seniority, expenses, price [trained]
Centering and scaling for seniority, time, age, expenses, income, assets, ... [trained]
Dummy variables from home, marital, records, job, status [trained]
Now that the recipe is prepared, we can apply our preprocessing steps to our data, which transforms the data as requested.
Before
glimpse(train_dat)
Rows: 3,341
Columns: 14
$ status <fct> good, bad, good, good, good, good, bad, good, good, good, good, bad, go…
$ seniority <dbl> 9, 10, 0, 1, 29, 9, 0, 6, 7, 8, 19, 0, 0, 15, 0, 1, 2, 5, 1, 27, 26, 12…
$ home <fct> rent, owner, rent, owner, owner, parents, parents, owner, owner, owner,…
$ time <dbl> 60, 36, 36, 60, 60, 12, 48, 48, 36, 60, 36, 18, 24, 24, 48, 60, 60, 60,…
$ age <dbl> 30, 46, 26, 36, 44, 27, 41, 34, 29, 30, 37, 21, 68, 52, 36, 31, 25, 22,…
$ marital <fct> married, married, single, married, married, single, married, married, m…
$ records <fct> no, yes, no, no, no, no, no, no, no, no, no, yes, no, no, no, no, no, n…
$ job <fct> freelance, freelance, fixed, fixed, fixed, fixed, partime, freelance, f…
$ expenses <dbl> 73, 90, 46, 75, 75, 35, 90, 60, 60, 75, 75, 35, 75, 35, 45, 35, 46, 45,…
$ income <dbl> 129, 200, 107, 214, 125, 80, 80, 125, 121, 199, 170, 50, 131, 330, 130,…
$ assets <dbl> 0, 3000, 0, 3500, 10000, 0, 0, 4000, 3000, 5000, 3500, 0, 4162, 16500, …
$ debt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2500, 260, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0…
$ amount <dbl> 800, 2000, 310, 650, 1600, 200, 1200, 1150, 650, 1500, 600, 400, 900, 1…
$ price <dbl> 846, 2985, 910, 1645, 1800, 1093, 1468, 1577, 915, 1650, 940, 500, 1186…
After
dat_processed <- bake(prepared_recipe,new_data = train_dat)
glimpse(dat_processed)
Rows: 3,341
Columns: 23
$ seniority <dbl> 0.55268042, 0.64683872, -1.72207650, -1.03730643, 1.63801530, 0…
$ time <dbl> 0.91886117, -0.72434246, -0.72434246, 0.91886117, 0.91886117, -…
$ age <dbl> -0.643976061, 0.814560686, -1.008610248, -0.097024781, 0.632243…
$ expenses <dbl> 0.9909869, 1.6198766, -0.3894374, 1.0720886, 1.0720886, -1.2002…
$ income <dbl> 0.059610992, 0.971859888, -0.328515103, 1.112816127, -0.0058135…
$ assets <dbl> -1.3069752, 0.5963635, -1.3069752, 0.6329966, 0.8825144, -1.306…
$ debt <dbl> -0.4426112, -0.4426112, -0.4426112, -0.4426112, -0.4426112, -0.…
$ amount <dbl> -0.31003590, 1.51548962, -2.19642943, -0.72347991, 1.07080608, …
$ price <dbl> -1.131668339, 1.897728092, -0.956533589, 0.465757253, 0.6821300…
$ home_other <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ home_owner <dbl> 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, …
$ home_parents <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
$ home_priv <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
$ home_rent <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, …
$ marital_married <dbl> 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, …
$ marital_separated <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ marital_single <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, …
$ marital_widow <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ records_yes <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ job_freelance <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
$ job_others <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ job_partime <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, …
$ status_good <dbl> 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, …
train_dat %>%
ggplot(aes(age)) +
geom_density(fill="pink",
alpha=.5)
dat_processed %>%
ggplot(aes(age)) +
geom_density(fill="pink",
alpha=.5)
In practice, we can build up our data preprocessing method in one go.
our_recipe <-
recipe(status ~ ., data = train_dat) %>%
step_meanimpute(all_numeric()) %>%
step_modeimpute(all_nominal()) %>%
step_log(amount,assets,debt,income,
seniority,expenses,price,offset = 1) %>%
step_normalize(all_numeric()) %>%
step_dummy(all_nominal()) %>%
prep()
Then just implement on the training data prior to running any model.
train_dat2 <- bake(our_recipe,new_data = train_dat)
And on the test data before checking performance.
test_dat2 <- bake(our_recipe,new_data = test_dat)
Again, why is this so important?