caret
In this notebook, we’ll apply the machine learning concepts covered in the classification supervised learning lecture. Note that there are many libraries that can perform the methods that we reviewed in class, but we’ll focus here on using the caret
package to perform these operations.
require(tidyverse)
require(caret) # for machine learning
require(recipes) # For preprocessing your data
require(rsample) # For train/test split
require(yardstick) # For performance metrics
require(rattle) # For nice tree plots
# For parallelization (to run the models across many cores) -- speeds up computation!
# install.packages("doMC")
doMC::registerDoMC()
The following data contains information regarding whether or not someone has health coverage. The outcome of interest is whether of not an individual has healthcare coverage or not. The available predictive features capture socio-economic and descriptive factors.
set.seed(1988)
dat = suppressMessages(read_csv("health-coverage.csv")) %>%
# Convert education into an ordered category.
mutate(
educ = case_when(
educ == 'Less than HS' ~ 0,
educ == 'HS Degree' ~ 1,
educ == 'Undergraduate Degree' ~ 2,
educ == 'Graduate Degree' ~ 3),
race = case_when(
race == 'White' ~ 'white',
race == 'Black' ~ 'black',
race == 'Asian' ~ 'asian',
T ~ 'other'),
race = factor(race,level=c("white",'black','asian','other')),
mar = factor(mar,levels= c('Never Married',
'Divorced','Married',
'Separated','Widowed'))
) %>%
# Convert all remaining factor variables into character variables
mutate_if(is.character,as.factor) %>%
# Make sure that the category you want to predict is the first category in
# your factor variable
mutate(coverage = factor(coverage,
levels = c("Coverage","No_Coverage"))) %>%
# Only taking a random sample of the data so the models run quicker
sample_n(5000)
head(dat) # Peek at the data just to make sure everything was read in correctly.
Before event looking at the data, let’s split the sample up into a training and test dataset. We’ll completely hold off on viewing the test data, so as not to bias our development of the learning model. Note that strata=
argument ensures that we have a similar proportion of covered and not covered individuals in the training and test data.
set.seed(123)
splits = initial_split(dat,prop = .8,strata = coverage)
train_data = training(splits) # Use 80% of the data as training data
test_data = testing(splits) # holdout 20% as test data
dim(train_data)
[1] 4002 7
dim(test_data)
[1] 998 7
NOTE:
skimr
provides a very nice summary of the data, but the mini-histograms will cause a lot of grief if you’re trying to knit to PDF. Feel free to useskimr
interactively but not if you’re knitting to PDF.
skimr::skim(train_data)
── Data Summary ────────────────────────
Values
Name train_data
Number of rows 4002
Number of columns 7
_______________________
Column type frequency:
factor 4
numeric 3
________________________
Group variables
── Variable type: factor ────────────────────────────────────────────────
skim_variable n_missing complete_rate ordered n_unique
1 coverage 0 1 FALSE 2
2 cit 0 1 FALSE 2
3 mar 0 1 FALSE 5
4 race 0 1 FALSE 4
top_counts
1 Cov: 2017, No_: 1985
2 Cit: 3603, Non: 399
3 Mar: 1736, Nev: 1450, Div: 508, Wid: 210
4 whi: 2451, bla: 1227, oth: 203, asi: 121
── Variable type: numeric ───────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25
1 age 0 1 43.9 17.6 16 29
2 wage 0 1 20119. 41877. 0 0
3 educ 0 1 1.02 0.787 0 1
p50 p75 p100 hist
1 43 57 92 ▇▇▇▃▁
2 3000 25000 419000 ▇▁▁▁▁
3 1 1 3 ▃▇▁▂▁
Visualize the distribution for each variable.
First, let’s look at the categorical variables.
train_data %>%
select_if(is.factor) %>%
gather(var,val) %>%
ggplot(aes(val)) +
geom_bar() +
scale_y_log10() +
facet_wrap(~var,scales="free_y",ncol=1) +
coord_flip() +
theme(text=element_text(size=16))
attributes are not identical across measure variables;
they will be dropped