In this notebook, we’ll apply the machine learning concepts covered in the classification supervised learning lecture. Note that there are many libraries that can perform the methods that we reviewed in class, but we’ll focus here on using the
caret package to perform these operations.
require(tidyverse) require(caret) # for machine learning require(recipes) # For preprocessing your data require(rsample) # For train/test split require(yardstick) # For performance metrics require(rattle) # For nice tree plots # For parallelization (to run the models across many cores) -- speeds up computation! # install.packages("doMC") doMC::registerDoMC()
The following data contains information regarding whether or not someone has health coverage. The outcome of interest is whether of not an individual has healthcare coverage or not. The available predictive features capture socio-economic and descriptive factors.
set.seed(1988) dat = suppressMessages(read_csv("health-coverage.csv")) %>% # Convert education into an ordered category. mutate( educ = case_when( educ == 'Less than HS' ~ 0, educ == 'HS Degree' ~ 1, educ == 'Undergraduate Degree' ~ 2, educ == 'Graduate Degree' ~ 3), race = case_when( race == 'White' ~ 'white', race == 'Black' ~ 'black', race == 'Asian' ~ 'asian', T ~ 'other'), race = factor(race,level=c("white",'black','asian','other')), mar = factor(mar,levels= c('Never Married', 'Divorced','Married', 'Separated','Widowed')) ) %>% # Convert all remaining factor variables into character variables mutate_if(is.character,as.factor) %>% # Make sure that the category you want to predict is the first category in # your factor variable mutate(coverage = factor(coverage, levels = c("Coverage","No_Coverage"))) %>% # Only taking a random sample of the data so the models run quicker sample_n(5000) head(dat) # Peek at the data just to make sure everything was read in correctly.
Before event looking at the data, let’s split the sample up into a training and test dataset. We’ll completely hold off on viewing the test data, so as not to bias our development of the learning model. Note that
strata= argument ensures that we have a similar proportion of covered and not covered individuals in the training and test data.
set.seed(123) splits = initial_split(dat,prop = .8,strata = coverage) train_data = training(splits) # Use 80% of the data as training data test_data = testing(splits) # holdout 20% as test data dim(train_data)
 4002 7
 998 7
skimrprovides a very nice summary of the data, but the mini-histograms will cause a lot of grief if you’re trying to knit to PDF. Feel free to use
skimrinteractively but not if you’re knitting to PDF.
── Data Summary ──────────────────────── Values Name train_data Number of rows 4002 Number of columns 7 _______________________ Column type frequency: factor 4 numeric 3 ________________________ Group variables ── Variable type: factor ──────────────────────────────────────────────── skim_variable n_missing complete_rate ordered n_unique 1 coverage 0 1 FALSE 2 2 cit 0 1 FALSE 2 3 mar 0 1 FALSE 5 4 race 0 1 FALSE 4 top_counts 1 Cov: 2017, No_: 1985 2 Cit: 3603, Non: 399 3 Mar: 1736, Nev: 1450, Div: 508, Wid: 210 4 whi: 2451, bla: 1227, oth: 203, asi: 121 ── Variable type: numeric ─────────────────────────────────────────────── skim_variable n_missing complete_rate mean sd p0 p25 1 age 0 1 43.9 17.6 16 29 2 wage 0 1 20119. 41877. 0 0 3 educ 0 1 1.02 0.787 0 1 p50 p75 p100 hist 1 43 57 92 ▇▇▇▃▁ 2 3000 25000 419000 ▇▁▁▁▁ 3 1 1 3 ▃▇▁▂▁
Visualize the distribution for each variable.
First, let’s look at the categorical variables.
train_data %>% select_if(is.factor) %>% gather(var,val) %>% ggplot(aes(val)) + geom_bar() + scale_y_log10() + facet_wrap(~var,scales="free_y",ncol=1) + coord_flip() + theme(text=element_text(size=16))
attributes are not identical across measure variables; they will be dropped