In this notebook, we’ll leverage the data and best fit model that we learned from the walkthrough last week (Week 11). The aim is to extract insights from the machine learning model to better understand who is covered and who is not.
require(tidyverse) require(caret) # for machine learning require(recipes) # For preprocessing your data require(vip) # For variable importance require(pdp) # pdp and ice plots # For parallelization (to run the models across many cores) -- speeds up computation! # install.packages("doMC") doMC::registerDoMC()
The following data contains information regarding whether or not someone has health coverage. The outcome of interest is whether of not an individual has healthcare coverage or not. The available predictive features capture socio-economic and descriptive factors.
set.seed(1988) # Again use a small random sample of the data so that it runs quicker dat = suppressMessages(read_csv("health-coverage.csv")) %>% # Convert education into an ordered category. mutate( educ = case_when( educ == 'Less than HS' ~ 0, educ == 'HS Degree' ~ 1, educ == 'Undergraduate Degree' ~ 2, educ == 'Graduate Degree' ~ 3), race = case_when( race == 'White' ~ 'white', race == 'Black' ~ 'black', race == 'Asian' ~ 'asian', T ~ 'other'), race = factor(race,level=c("white",'black','asian','other')), mar = factor(mar,levels= c('Never Married', 'Divorced','Married', 'Separated','Widowed')) ) %>% # Convert all remaining factor variables into character variables mutate_if(is.character,as.factor) %>% # Make sure that the category you want to predict is the first category in # your factor variable mutate(coverage = factor(coverage, levels = c("Coverage","No_Coverage"))) %>% sample_n(5000) head(dat) # Peek at the data just to make sure everything was read in correctly.
As before, let’s split the sample up into a training and test dataset. We’ll completely hold off on viewing the test data. We’ll leverage the test data to explore some of the insights we learn from the training data. Let’s use
rsample to split the data.
set.seed(123) # install.packages("rsample") # If you don't have the pack split <- rsample::initial_split(dat,prop = .8,) train_data <- rsample::training(split) test_data <- rsample::testing(split) # Breakdown nrow(train_data)/nrow(dat)
In the Week 11 Walkthrough, we explore the data and outlined a preprocessing strategy. Let’s re-implement it here.
rcp <- recipe(coverage ~ .,train_data) %>% step_dummy(all_nominal(),-coverage) %>% # step_log(wage,offset = 1) %>% # Log the skewed wage variable step_range(all_numeric()) %>% # Normalize scale prep() # Apply the recipe to the training and test data train_data2 <- bake(rcp,train_data) test_data2 <- bake(rcp,test_data)
Let’s use the k-fold cross-validation method with 5 folds in the data to tune the model. We also need to specify that we’re dealing with a classification problem: recall we can do this by adding
summaryFunction = twoClassSummary and
classProbs = TRUE to the cross-validation function.
set.seed(1988) # set a seed for replication purposes # Partition the data into 5 equal folds folds <- createFolds(train_data2$coverage, k = 5) # Cross validation object control_conditions <- trainControl(method='cv', # K-fold cross validation summaryFunction = twoClassSummary, # Two classes classProbs = TRUE, # Need this b/c it's a classification problem index = folds # The indices for our folds )
In the Week 11 Walkthrough, we found that the random forest model had the best performance on predicting the outcome. Let’s re-run that model here.
mod_rf <- train(coverage ~ ., # Equation (outcome and everything else) data=train_data2, # Training data method = "ranger", # random forest (ranger is much faster than rf) metric = "ROC", # area under the curve trControl = control_conditions )
Plot the model output. When we plot the performance of the above model, we see that we get a fairly high AUC (.83) on the held out data.
# AUC on the Training data prob_pred <- predict(mod_rf,type="prob")$Coverage auc_rf <- Metrics::auc(train_data2$coverage=="Coverage",prob_pred) auc_rf
# Accuracy on the Training data acc_rf <- Metrics::accuracy(train_data2$coverage,predict(mod_rf)) acc_rf
Let’s now extract some substantive insights from the model using variable importance approach learned in class. Which variables does the model rely on most to make it’s predictions?
vi_plot<- vip(mod_rf, # Machine learning model train = train_data2, # Training data method="permute", # permuted importance nsim = 10, # number of times to impute geom = "boxplot", # Type of plot target = "coverage", # outcome reference_class = "Coverage", # what class are you predicting metric = "accuracy", pred_wrapper = predict) # Plot VIP vi_plot