Overview

In this walkthrough, we’ll explore how pre-process data using the recipes package. Data preprocessing is the process of preparing our data for modeling. As we’ve seen, data comes in all shapes and sizes. Some data is dirty and needs to be cleaned before building a model with it. Specifically, data can be:

  • Scaled and/or skewed
  • Missing
  • Be non-numeric (and thus not something a machine can process) – e.g. categorical or character data.

Pre-processing is necessary step used to resolve these issues.Some common pre-processing tasks are:

  • Scaling and transforming continuous values

  • Converting categorical variables to dummy variables.

  • Detecting and imputing missing values

Dependencies

Here are the packages we’ll use today in this walkthrough.

require(tidyverse) 
require(recipes) # For data pre-processing
require(rsample) # For generating train-test splits

Data

The provided data captures whether or not a person will pay back a bank loan. The outcome variable is status, which is a qualitative outcome that takes one of two values: good if the individual has good credit, otherwise bad. There are 13 other variables tracking features about the debtor and the loan.

We’ll encounter this data later on when we learn more about modeling. Right now, we’ll use it to walkthrough pre-processing concepts.

Parsed with column specification:
cols(
  Status = col_character(),
  Seniority = col_double(),
  Home = col_character(),
  Time = col_double(),
  Age = col_double(),
  Marital = col_character(),
  Records = col_character(),
  Job = col_character(),
  Expenses = col_double(),
  Income = col_double(),
  Assets = col_double(),
  Debt = col_double(),
  Amount = col_double(),
  Price = col_double()
)

Train-Test Split

Before doing anything with our data, we want to split it into a training and test dataset. We don’t want to learn anything about our test data, and that even includes summary statistics. Thus before we look at and explore our data (to figure out how to preprocess it), we want to split it into a training and test dataset.

We can easily break our data up into a training and test dataset using the rsample package. First, we want to generate a split object using the initial_split() function. Note that the prop= argument dictates the proportion of the data we want to keep as training data.

set.seed(123) # We set a seed so we can reproduce the random split
splits <- initial_split(credit_data,prop = .75)
splits
<Analysis/Assess/Total>
<3341/1113/4454>

Then break into training and test datasets.

train_dat <- training(splits)
test_dat <- testing(splits)

If we look at the number of observations, we see that we get precisely the proportions that were requested in each dataset.

nrow(train_dat)/nrow(credit_data)
[1] 0.7501123
nrow(test_dat)/nrow(credit_data)
[1] 0.2498877

Note: never look at the training data. Maintaining this rule is how we can do good social science using machine learning methods. More on this later.

Data Exploration

Data Types: Note we have variables of different classes/types. How we preprocess these data will differ given its type.

train_dat %>% 
  summarize_all(class) %>% 
  glimpse()
Rows: 1
Columns: 14
$ status    <chr> "factor"
$ seniority <chr> "numeric"
$ home      <chr> "factor"
$ time      <chr> "numeric"
$ age       <chr> "numeric"
$ marital   <chr> "factor"
$ records   <chr> "factor"
$ job       <chr> "factor"
$ expenses  <chr> "numeric"
$ income    <chr> "numeric"
$ assets    <chr> "numeric"
$ debt      <chr> "numeric"
$ amount    <chr> "numeric"
$ price     <chr> "numeric"

Of the numberic variable, we clearly have variables on different scales… with some missing values!

train_dat %>% 
  summarize_if(is.numeric, function(x) mean(x)) %>% 
  glimpse()
Rows: 1
Columns: 9
$ seniority <dbl> 7.973661
$ time      <dbl> 46.57947
$ age       <dbl> 37.06435
$ expenses  <dbl> 55.48399
$ income    <dbl> NA
$ assets    <dbl> NA
$ debt      <dbl> NA
$ amount    <dbl> 1044.601
$ price     <dbl> 1470.798

We can drop missing values when calculating a summary statistic with the na.rm = T argument.

train_dat %>% 
  summarize_if(is.numeric, function(x) mean(x,na.rm = T)) %>% 
  glimpse()
Rows: 1
Columns: 9
$ seniority <dbl> 7.973661
$ time      <dbl> 46.57947
$ age       <dbl> 37.06435
$ expenses  <dbl> 55.48399
$ income    <dbl> 140.7199
$ assets    <dbl> 5285.176
$ debt      <dbl> 320.4593
$ amount    <dbl> 1044.601
$ price     <dbl> 1470.798

Of the categorical variable, we also have missing values

train_dat %>% 
  select_if(is.factor) %>% 
  glimpse()
Rows: 3,341
Columns: 5
$ status  <fct> good, bad, good, good, good, good, bad, good, good, good, good, bad, good…
$ home    <fct> rent, owner, rent, owner, owner, parents, parents, owner, owner, owner, p…
$ marital <fct> married, married, single, married, married, single, married, married, mar…
$ records <fct> no, yes, no, no, no, no, no, no, no, no, no, yes, no, no, no, no, no, no,…
$ job     <fct> freelance, freelance, fixed, fixed, fixed, fixed, partime, freelance, fix…
train_dat %>% 
  select_if(is.factor) %>% 
  summarize_all(function(x) sum(is.na(x)))

Distributions

How is the continuous data distributed? And what does this tell us?

train_dat %>% 
  
  # only select the numeric variables
  select_if(is.numeric) %>% 
  
  # Pivot to longer data format (for faceting)
  pivot_longer(cols = everything()) %>% 
  
  # Plot histograms for each variable
  ggplot(aes(value)) +
  
  geom_histogram() +
  
  facet_wrap(~name,scales="free",ncol=3)