caret
In this notebook, we’ll apply the machine learning concepts covered in the classification supervised learning lecture. Note that there are many libraries that can perform the methods that we reviewed in class, but we’ll focus here on using the caret
package to perform these operations.
require(tidyverse)
require(caret) # for machine learning
require(recipes) # For preprocessing your data
require(rsample) # For train/test split
require(yardstick) # For performance metrics
require(rattle) # For nice tree plots
# For parallelization (to run the models across many cores) -- speeds up computation!
# install.packages("doMC")
doMC::registerDoMC()
The following data contains information regarding whether or not someone has health coverage. The outcome of interest is whether of not an individual has healthcare coverage or not. The available predictive features capture socio-economic and descriptive factors.
set.seed(1988)
dat = suppressMessages(read_csv("health-coverage.csv")) %>%
# Convert education into an ordered category.
mutate(
educ = case_when(
educ == 'Less than HS' ~ 0,
educ == 'HS Degree' ~ 1,
educ == 'Undergraduate Degree' ~ 2,
educ == 'Graduate Degree' ~ 3),
race = case_when(
race == 'White' ~ 'white',
race == 'Black' ~ 'black',
race == 'Asian' ~ 'asian',
T ~ 'other'),
race = factor(race,level=c("white",'black','asian','other')),
mar = factor(mar,levels= c('Never Married',
'Divorced','Married',
'Separated','Widowed'))
) %>%
# Convert all remaining factor variables into character variables
mutate_if(is.character,as.factor) %>%
# Make sure that the category you want to predict is the first category in
# your factor variable
mutate(coverage = factor(coverage,
levels = c("Coverage","No_Coverage"))) %>%
# Only taking a random sample of the data so the models run quicker
sample_n(5000)
head(dat) # Peek at the data just to make sure everything was read in correctly.
Before event looking at the data, let’s split the sample up into a training and test dataset. We’ll completely hold off on viewing the test data, so as not to bias our development of the learning model. Note that strata=
argument ensures that we have a similar proportion of covered and not covered individuals in the training and test data.
set.seed(123)
splits = initial_split(dat,prop = .8,strata = coverage)
train_data = training(splits) # Use 80% of the data as training data
test_data = testing(splits) # holdout 20% as test data
dim(train_data)
[1] 4002 7
dim(test_data)
[1] 998 7
NOTE:
skimr
provides a very nice summary of the data, but the mini-histograms will cause a lot of grief if you’re trying to knit to PDF. Feel free to useskimr
interactively but not if you’re knitting to PDF.
skimr::skim(train_data)
── Data Summary ────────────────────────
Values
Name train_data
Number of rows 4002
Number of columns 7
_______________________
Column type frequency:
factor 4
numeric 3
________________________
Group variables
── Variable type: factor ────────────────────────────────────────────────
skim_variable n_missing complete_rate ordered n_unique
1 coverage 0 1 FALSE 2
2 cit 0 1 FALSE 2
3 mar 0 1 FALSE 5
4 race 0 1 FALSE 4
top_counts
1 Cov: 2017, No_: 1985
2 Cit: 3603, Non: 399
3 Mar: 1736, Nev: 1450, Div: 508, Wid: 210
4 whi: 2451, bla: 1227, oth: 203, asi: 121
── Variable type: numeric ───────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25
1 age 0 1 43.9 17.6 16 29
2 wage 0 1 20119. 41877. 0 0
3 educ 0 1 1.02 0.787 0 1
p50 p75 p100 hist
1 43 57 92 ▇▇▇▃▁
2 3000 25000 419000 ▇▁▁▁▁
3 1 1 3 ▃▇▁▂▁
Visualize the distribution for each variable.
First, let’s look at the categorical variables.
train_data %>%
select_if(is.factor) %>%
gather(var,val) %>%
ggplot(aes(val)) +
geom_bar() +
scale_y_log10() +
facet_wrap(~var,scales="free_y",ncol=1) +
coord_flip() +
theme(text=element_text(size=16))
attributes are not identical across measure variables;
they will be dropped
Few things to note:
Second, let’s look at the distribution of the continuous variables
train_data %>%
select_if(is.numeric) %>%
gather(var,val) %>%
ggplot(aes(val)) +
geom_histogram(bins = 75) +
facet_wrap(~var,scales="free",ncol=1)
Things to note:
Let’s peek more closely at the distribution of wealth in the sample.
train_data %>%
mutate(wealth_bins = case_when(
wage == 0 ~ "Unemployed",
wage > 0 & wage<= 30000 ~ "Low",
wage > 30000 & wage<= 75000 ~ "Mid",
wage > 75000 & wage<= 200000 ~ "High",
wage > 200000 ~ "Jeez...!",
)) %>%
mutate(wealth_bins = fct_infreq(wealth_bins)) %>%
ggplot(aes(wealth_bins)) +
geom_bar()
There are a lot of individuals that report 0 income (Unemployed?). What is the age distribution for those individuals?
train_data %>%
filter(wage==0) %>%
ggplot(aes(age)) +
geom_density(fill="steelblue",alpha=.5,color="white")
What do we need to do?
rcp <-
recipe(coverage ~ .,train_data) %>%
step_dummy(all_nominal(),-coverage) %>% # Why exclude outcome variable?
step_log(wage,offset = 1) %>% # Log the skewed wage variable
step_range(all_numeric()) %>% # Normalize scale
prep()
# Apply the recipe to the training and test data
train_data2 <- bake(rcp,train_data)
test_data2 <- bake(rcp,test_data)
Check that our pre-processing worked.
head(train_data2)
train_data2 %>%
select(wage,age) %>%
gather(var,val) %>%
ggplot(aes(val)) +
geom_histogram(bins = 75) +
facet_wrap(~var,scales="free",ncol=1)
When comparing different machine learning models, we want to make sure we’re making a fair and equal comparison. One way that random chance can sneak into our assessments of model fit is through cross-validation. We want to make sure that we’re cross-validating the data on the exact same data partitions.
caret
makes this ease to do. Let’s use the k-fold cross-validation method with 10 folds in the data.
set.seed(1988) # set a seed for replication purposes
# Partition the data into 5 equal folds
folds <- createFolds(train_data2$coverage, k = 5)
sapply(folds,length)
Fold1 Fold2 Fold3 Fold4 Fold5
800 801 800 801 800
Now, let’s use the trainControl()
function from caret
to set up our validation conditions. An important changes from last week: we have to tell caret
that we are dealing with a classification problem. We can do this by adding summaryFunction = twoClassSummary
and classProbs = TRUE
to the cross-validation function.
control_conditions <-
trainControl(method='cv', # K-fold cross validation
summaryFunction = twoClassSummary, # Need this b/c it's a classification problem
classProbs = TRUE, # Need this b/c it's a classification problem
index = folds # The indices for our folds (so they are always the same)
)
We’ll now use this same cross-validation object for everything that we do.
Let’s explore the different models that we covered in the lecture using the same package framework. As we saw last time in class, caret
facilitates this task nicely.
mod_logit <-
train(coverage ~ .,
data=train_data2, # Training data
method = "glm", # logit function
metric = "ROC", # area under the curve
trControl = control_conditions
)
mod_logit
Generalized Linear Model
4002 samples
11 predictor
2 classes: 'Coverage', 'No_Coverage'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 800, 801, 800, 801, 800
Resampling results:
ROC Sens Spec
0.7800115 0.7149195 0.7083123
Remember: these algorithms are computationally demanding, so they can take a little time to run depending on the power of your machine
mod_knn <-
train(coverage ~ ., # Equation (outcome and everything else)
data=train_data2, # Training data
method = "knn", # K-Nearest Neighbors Algorithm
metric = "ROC", # area under the curve
trControl = control_conditions
)
Let’s look at the performance plots. What do we see?
mod_knn
k-Nearest Neighbors
4002 samples
11 predictor
2 classes: 'Coverage', 'No_Coverage'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 800, 801, 800, 801, 800
Resampling results across tuning parameters:
k ROC Sens Spec
5 0.7658549 0.6870269 0.7056675
7 0.7708521 0.6790933 0.7148615
9 0.7738201 0.6697991 0.7243073
ROC was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
caret
has default settings that auto explore different tunning parameters for the model.plot(mod_knn)
Now, let’s say we wanted to adjust the tunning parameters. We can explore different model tunings by using the tuneGrid
argument.
knn_tune = expand.grid(k = c(1,3,10,50))
knn_tune
mod_knn2 <-
train(coverage ~ ., # Equation (outcome and everything else)
data=train_data2, # Training data
method = "knn", # K-Nearest Neighbors Algorithm
metric = "ROC", # area under the curve
tuneGrid = knn_tune, # add the tuning parameters here
trControl = control_conditions
)
plot(mod_knn2)
mod_cart <-
train(coverage ~ ., # Equation (outcome and everything else)
data=train_data2, # Training data
method = "rpart", # Classification Tree
metric = "ROC", # area under the curve
trControl = control_conditions
)
plot(mod_cart)
Shallower trees converge to a coin flip. The deepest tree appears to be the bets
Let’s visualize the decision tree…
# This tree goes really deep
fancyRpartPlot(mod_cart$finalModel)
We can actually print out the larger decision tree to look at it as a print out…. as we can see this is a LOT.
print(mod_cart$finalModel)
n= 4002
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 4002 1985 Coverage (0.50399800 0.49600200)
2) age>=0.6381579 522 20 Coverage (0.96168582 0.03831418) *
3) age< 0.6381579 3480 1515 No_Coverage (0.43534483 0.56465517)
6) wage>=0.8002206 755 167 Coverage (0.77880795 0.22119205) *
7) wage< 0.8002206 2725 927 No_Coverage (0.34018349 0.65981651) *
mod_rf <-
train(coverage ~ ., # Equation (outcome and everything else)
data=train_data2, # Training data
method = "ranger", # random forest (ranger is much faster than rf)
metric = "ROC", # area under the curve
trControl = control_conditions
)
There are three tunning parameters in this model:
mtry
: the number of predictors that we’ll randomly select.splitrule
: the way we determine how the nodes should be split
gini
= node purity (see class lecture)extratrees
= doesn’t bag, randomly select vars, but just randomly split, accept the best split. (Short for “Extremely Randomize Trees”)plot(mod_rf)
Let’s explore three versions of the support vector algorithm: linear boundary, a polynomial (kernel) boundary, and a radial boundary. Like the Logistic regression, SVM will complain when there isn’t sufficient variation within a variable.
mod_svm_linear <-
train(coverage ~ . ,
data=train_data2, # Training data
method = "svmLinear", # SVM with a polynomial Kernel
metric = "ROC", # area under the curve
tuneGrid = expand.grid(C = c(.5,1)), # Add two tuning parameters
trControl = control_conditions
)
mod_svm_linear
Support Vector Machines with Linear Kernel
4002 samples
11 predictor
2 classes: 'Coverage', 'No_Coverage'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 800, 801, 800, 801, 800
Resampling results across tuning parameters:
C ROC Sens Spec
0.5 0.7774877 0.6969392 0.7113350
1.0 0.7767458 0.6887603 0.7193955
ROC was used to select the optimal model using the largest value.
The final value used for the model was C = 0.5.
mod_svm_poly <-
train(coverage ~ .,
data=train_data2, # Training data
method = "svmPoly", # SVM with a polynomial Kernel
metric = "ROC", # area under the curve
trControl = control_conditions
)
plot(mod_svm_poly)
mod_svm_radial <-
train(coverage ~ .,
data=train_data2, # Training data
method = "svmRadial", # SVM with a Radial Kernel
metric = "ROC", # area under the curve
trControl = control_conditions
)
plot(mod_svm_radial)
How did the different methods perform? Which one did the best?
# Organize all model imputs as a list.
mod_list <-
list(
glm=mod_logit,
knn1 = mod_knn,
knn2 = mod_knn2,
cart = mod_cart,
rf = mod_rf,
svm_linear = mod_svm_linear,
svm_poly = mod_svm_poly,
svm_radial = mod_svm_radial
)
# Generate Plot to compare output.
dotplot(resamples(mod_list),metric = "ROC")
Examine the predictive performance of the best performing model.
pred <- predict(mod_rf, newdata = test_data2)
confusionMatrix(table(pred,test_data2$coverage))
Confusion Matrix and Statistics
pred Coverage No_Coverage
Coverage 341 90
No_Coverage 162 405
Accuracy : 0.7475
95% CI : (0.7193, 0.7742)
No Information Rate : 0.504
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.4955
Mcnemar's Test P-Value : 7.728e-06
Sensitivity : 0.6779
Specificity : 0.8182
Pos Pred Value : 0.7912
Neg Pred Value : 0.7143
Prevalence : 0.5040
Detection Rate : 0.3417
Detection Prevalence : 0.4319
Balanced Accuracy : 0.7481
'Positive' Class : Coverage
We can also use the yardstick
package to generate specific performance metrics.
# Generates predicted probabilities for both 1 ("Coverage") and 0 ("No Coverage")
pred_probability <- predict(mod_rf,newdata = test_data2,type="prob")
pred <- predict(mod_rf,newdata = test_data2,type="raw")
# Organize as a data frame
preformance <- tibble(truth = test_data2$coverage,
prob = pred_probability$Coverage,
pred = pred)
# Calculate performance metrics
bind_rows(
preformance %>% roc_auc(truth,prob),
preformance %>% accuracy(truth,pred)
)