Overview

In this notebook, we’ll apply the machine learning concepts covered in the classification supervised learning lecture. Note that there are many libraries that can perform the methods that we reviewed in class, but we’ll focus here on using the caret package to perform these operations.

require(tidyverse)
require(caret) # for machine learning
require(recipes) # For preprocessing your data
require(rsample) # For train/test split
require(yardstick) # For performance metrics
require(rattle) # For nice tree plots

# For parallelization (to run the models across many cores) -- speeds up computation!
# install.packages("doMC")
doMC::registerDoMC()

Data

The following data contains information regarding whether or not someone has health coverage. The outcome of interest is whether of not an individual has healthcare coverage or not. The available predictive features capture socio-economic and descriptive factors.

set.seed(1988)

dat = suppressMessages(read_csv("health-coverage.csv")) %>% 
  
  # Convert education into an ordered category.
  mutate(
    
    educ = case_when(
      educ == 'Less than HS' ~ 0,
      educ == 'HS Degree' ~ 1,
      educ == 'Undergraduate Degree' ~ 2,
      educ == 'Graduate Degree' ~ 3),
    
    race = case_when(
      race == 'White' ~ 'white',
      race == 'Black' ~ 'black',
      race == 'Asian' ~ 'asian',
      T ~ 'other'),
    
    race = factor(race,level=c("white",'black','asian','other')),
    
    mar = factor(mar,levels= c('Never Married',
                               'Divorced','Married',
                               'Separated','Widowed'))
    
    ) %>% 
  
  # Convert all remaining factor variables into character variables
  mutate_if(is.character,as.factor) %>% 
  
  # Make sure that the category you want to predict is the first category in
  # your factor variable
  mutate(coverage = factor(coverage,
                           levels = c("Coverage","No_Coverage"))) %>% 
  
  # Only taking a random sample of the data so the models run quicker
  sample_n(5000) 

head(dat) # Peek at the data just to make sure everything was read in correctly. 

Split the Sample: Training and test data

Before event looking at the data, let’s split the sample up into a training and test dataset. We’ll completely hold off on viewing the test data, so as not to bias our development of the learning model. Note that strata= argument ensures that we have a similar proportion of covered and not covered individuals in the training and test data.

set.seed(123)
splits = initial_split(dat,prop = .8,strata = coverage)
train_data = training(splits) # Use 80% of the data as training data 
test_data = testing(splits) # holdout 20% as test data 

dim(train_data)
[1] 4002    7
dim(test_data) 
[1] 998   7

Examine the data

NOTE: skimr provides a very nice summary of the data, but the mini-histograms will cause a lot of grief if you’re trying to knit to PDF. Feel free to use skimr interactively but not if you’re knitting to PDF.

skimr::skim(train_data)
── Data Summary ────────────────────────
                           Values    
Name                       train_data
Number of rows             4002      
Number of columns          7         
_______________________              
Column type frequency:               
  factor                   4         
  numeric                  3         
________________________             
Group variables                      

── Variable type: factor ────────────────────────────────────────────────
  skim_variable n_missing complete_rate ordered n_unique
1 coverage              0             1 FALSE          2
2 cit                   0             1 FALSE          2
3 mar                   0             1 FALSE          5
4 race                  0             1 FALSE          4
  top_counts                              
1 Cov: 2017, No_: 1985                    
2 Cit: 3603, Non: 399                     
3 Mar: 1736, Nev: 1450, Div: 508, Wid: 210
4 whi: 2451, bla: 1227, oth: 203, asi: 121

── Variable type: numeric ───────────────────────────────────────────────
  skim_variable n_missing complete_rate     mean        sd    p0   p25
1 age                   0             1    43.9     17.6      16    29
2 wage                  0             1 20119.   41877.        0     0
3 educ                  0             1     1.02     0.787     0     1
    p50   p75   p100 hist 
1    43    57     92 ▇▇▇▃▁
2  3000 25000 419000 ▇▁▁▁▁
3     1     1      3 ▃▇▁▂▁

Visualize the distribution for each variable.

First, let’s look at the categorical variables.

train_data %>% 
  select_if(is.factor) %>% 
  gather(var,val) %>% 
  ggplot(aes(val)) +
  geom_bar() +
  scale_y_log10() +
  facet_wrap(~var,scales="free_y",ncol=1) +
  coord_flip() +
  theme(text=element_text(size=16))
attributes are not identical across measure variables;
they will be dropped