Overview

In this notebook, we’ll leverage the data and best fit model that we learned from the walkthrough last week (Week 11). The aim is to extract insights from the machine learning model to better understand who is covered and who is not.

require(tidyverse)
require(caret) # for machine learning
require(recipes) # For preprocessing your data
require(vip) # For variable importance
require(pdp) # pdp and ice plots

# For parallelization (to run the models across many cores) -- speeds up computation!
# install.packages("doMC")
doMC::registerDoMC()

Data

The following data contains information regarding whether or not someone has health coverage. The outcome of interest is whether of not an individual has healthcare coverage or not. The available predictive features capture socio-economic and descriptive factors.

set.seed(1988)

# Again use a small random sample of the data so that it runs quicker
dat = suppressMessages(read_csv("health-coverage.csv")) %>%
  
  # Convert education into an ordered category.
  mutate(
    
    educ = case_when(
      educ == 'Less than HS' ~ 0,
      educ == 'HS Degree' ~ 1,
      educ == 'Undergraduate Degree' ~ 2,
      educ == 'Graduate Degree' ~ 3),
    
    race = case_when(
      race == 'White' ~ 'white',
      race == 'Black' ~ 'black',
      race == 'Asian' ~ 'asian',
      T ~ 'other'),
    
    race = factor(race,level=c("white",'black','asian','other')),
    
    mar = factor(mar,levels= c('Never Married',
                               'Divorced','Married',
                               'Separated','Widowed'))
    
    ) %>% 
  
  # Convert all remaining factor variables into character variables
  mutate_if(is.character,as.factor) %>% 
  
  # Make sure that the category you want to predict is the first category in
  # your factor variable
  mutate(coverage = factor(coverage,
                           levels = c("Coverage","No_Coverage"))) %>% 
  
  sample_n(5000)

head(dat) # Peek at the data just to make sure everything was read in correctly. 

Split the Sample: Training and test data

As before, let’s split the sample up into a training and test dataset. We’ll completely hold off on viewing the test data. We’ll leverage the test data to explore some of the insights we learn from the training data. Let’s use rsample to split the data.

set.seed(123)
# install.packages("rsample") # If you don't have the pack
split <- rsample::initial_split(dat,prop = .8,)
train_data <- rsample::training(split) 
test_data  <- rsample::testing(split) 


# Breakdown
nrow(train_data)/nrow(dat) 
[1] 0.8002
nrow(test_data)/nrow(dat)
[1] 0.1998

Data Pre-processing

In the Week 11 Walkthrough, we explore the data and outlined a preprocessing strategy. Let’s re-implement it here.

rcp <- 
  recipe(coverage ~  .,train_data) %>% 
  step_dummy(all_nominal(),-coverage) %>% #
  step_log(wage,offset = 1) %>% # Log the skewed wage variable
  step_range(all_numeric()) %>%  # Normalize scale
  prep()


# Apply the recipe to the training and test data
train_data2 <- bake(rcp,train_data)
test_data2 <- bake(rcp,test_data) 

Cross-validation

Let’s use the k-fold cross-validation method with 5 folds in the data to tune the model. We also need to specify that we’re dealing with a classification problem: recall we can do this by adding summaryFunction = twoClassSummary and classProbs = TRUE to the cross-validation function.

set.seed(1988) # set a seed for replication purposes 

# Partition the data into 5 equal folds
folds <- createFolds(train_data2$coverage, k = 5) 

# Cross validation object
control_conditions <- 
  trainControl(method='cv', # K-fold cross validation
               summaryFunction = twoClassSummary, # Two classes 
               classProbs = TRUE, # Need this b/c it's a classification problem
               index = folds # The indices for our folds
  )

Fit Model

In the Week 11 Walkthrough, we found that the random forest model had the best performance on predicting the outcome. Let’s re-run that model here.

mod_rf <-
  train(coverage ~ ., # Equation (outcome and everything else)
        data=train_data2, # Training data 
        method = "ranger", # random forest (ranger is much faster than rf)
        metric = "ROC", # area under the curve
        trControl = control_conditions
  )

Plot the model output. When we plot the performance of the above model, we see that we get a fairly high AUC (.83) on the held out data.

# AUC on the Training data 
prob_pred <- predict(mod_rf,type="prob")$Coverage
auc_rf <- Metrics::auc(train_data2$coverage=="Coverage",prob_pred)
auc_rf
[1] 0.8838971
# Accuracy on the Training data 
acc_rf <- Metrics::accuracy(train_data2$coverage,predict(mod_rf))
acc_rf
[1] 0.7853037

Model Interpretation

Variable Importance

Let’s now extract some substantive insights from the model using variable importance approach learned in class. Which variables does the model rely on most to make it’s predictions?

vi_plot<- 
  vip(mod_rf, # Machine learning model
      train = train_data2, # Training data 
      method="permute", # permuted importance
      nsim = 10, # number of times to impute
      geom = "boxplot", # Type of plot 
      target = "coverage", # outcome
      reference_class = "Coverage", # what class are you predicting
      metric = "accuracy",
      pred_wrapper = predict)

# Plot VIP
vi_plot