In this notebook, we’ll leverage the data and best fit model that we learned from the walkthrough last week (Week 11). The aim is to extract insights from the machine learning model to better understand who is covered and who is not.
require(tidyverse)
require(caret) # for machine learning
require(recipes) # For preprocessing your data
require(vip) # For variable importance
require(pdp) # pdp and ice plots
# For parallelization (to run the models across many cores) -- speeds up computation!
# install.packages("doMC")
doMC::registerDoMC()
The following data contains information regarding whether or not someone has health coverage. The outcome of interest is whether of not an individual has healthcare coverage or not. The available predictive features capture socio-economic and descriptive factors.
set.seed(1988)
# Again use a small random sample of the data so that it runs quicker
dat = suppressMessages(read_csv("health-coverage.csv")) %>%
# Convert education into an ordered category.
mutate(
educ = case_when(
educ == 'Less than HS' ~ 0,
educ == 'HS Degree' ~ 1,
educ == 'Undergraduate Degree' ~ 2,
educ == 'Graduate Degree' ~ 3),
race = case_when(
race == 'White' ~ 'white',
race == 'Black' ~ 'black',
race == 'Asian' ~ 'asian',
T ~ 'other'),
race = factor(race,level=c("white",'black','asian','other')),
mar = factor(mar,levels= c('Never Married',
'Divorced','Married',
'Separated','Widowed'))
) %>%
# Convert all remaining factor variables into character variables
mutate_if(is.character,as.factor) %>%
# Make sure that the category you want to predict is the first category in
# your factor variable
mutate(coverage = factor(coverage,
levels = c("Coverage","No_Coverage"))) %>%
sample_n(5000)
head(dat) # Peek at the data just to make sure everything was read in correctly.
As before, let’s split the sample up into a training and test dataset. We’ll completely hold off on viewing the test data. We’ll leverage the test data to explore some of the insights we learn from the training data. Let’s use rsample
to split the data.
set.seed(123)
# install.packages("rsample") # If you don't have the pack
split <- rsample::initial_split(dat,prop = .8,)
train_data <- rsample::training(split)
test_data <- rsample::testing(split)
# Breakdown
nrow(train_data)/nrow(dat)
[1] 0.8002
nrow(test_data)/nrow(dat)
[1] 0.1998
In the Week 11 Walkthrough, we explore the data and outlined a preprocessing strategy. Let’s re-implement it here.
rcp <-
recipe(coverage ~ .,train_data) %>%
step_dummy(all_nominal(),-coverage) %>% #
step_log(wage,offset = 1) %>% # Log the skewed wage variable
step_range(all_numeric()) %>% # Normalize scale
prep()
# Apply the recipe to the training and test data
train_data2 <- bake(rcp,train_data)
test_data2 <- bake(rcp,test_data)
Let’s use the k-fold cross-validation method with 5 folds in the data to tune the model. We also need to specify that we’re dealing with a classification problem: recall we can do this by adding summaryFunction = twoClassSummary
and classProbs = TRUE
to the cross-validation function.
set.seed(1988) # set a seed for replication purposes
# Partition the data into 5 equal folds
folds <- createFolds(train_data2$coverage, k = 5)
# Cross validation object
control_conditions <-
trainControl(method='cv', # K-fold cross validation
summaryFunction = twoClassSummary, # Two classes
classProbs = TRUE, # Need this b/c it's a classification problem
index = folds # The indices for our folds
)
In the Week 11 Walkthrough, we found that the random forest model had the best performance on predicting the outcome. Let’s re-run that model here.
mod_rf <-
train(coverage ~ ., # Equation (outcome and everything else)
data=train_data2, # Training data
method = "ranger", # random forest (ranger is much faster than rf)
metric = "ROC", # area under the curve
trControl = control_conditions
)
Plot the model output. When we plot the performance of the above model, we see that we get a fairly high AUC (.83) on the held out data.
# AUC on the Training data
prob_pred <- predict(mod_rf,type="prob")$Coverage
auc_rf <- Metrics::auc(train_data2$coverage=="Coverage",prob_pred)
auc_rf
[1] 0.8838971
# Accuracy on the Training data
acc_rf <- Metrics::accuracy(train_data2$coverage,predict(mod_rf))
acc_rf
[1] 0.7853037
Let’s now extract some substantive insights from the model using variable importance approach learned in class. Which variables does the model rely on most to make it’s predictions?
vi_plot<-
vip(mod_rf, # Machine learning model
train = train_data2, # Training data
method="permute", # permuted importance
nsim = 10, # number of times to impute
geom = "boxplot", # Type of plot
target = "coverage", # outcome
reference_class = "Coverage", # what class are you predicting
metric = "accuracy",
pred_wrapper = predict)
# Plot VIP
vi_plot