PPOL564 | Data Science 1 | Foundations

Interpretable Machine Learning Walkthrough

Overview

The aim is to extract insights from using a statistical learning model to better understand who does and does not have healthcare coverage.

Dependencies

Data

The following data contains information regarding whether or not someone has health coverage (coverage). The available predictive features capture socio-economic and descriptive factors.

Note: We'll only take a subset of this data so that the models will run faster, but feel free to explore on the entire data set.

Convert categorical variables to categories.

Missingness?

Split

Explore Training Set

Looks like there is a right skew in wage. Consider log transforming, but also notice another skew once we log transform.

To tackle this, let's consider converting wage into a ordinal measure where the base category is "unemployed".

Now, let's look at the categorical predictors

Things to note:

Preprocessing

High-Level Preprocessing

There are high-level transformation we want to impose on training and test data. These transformations don't require that use any information from the test data (i.e. a mean, a min, etc). Rather there are some formatting changes/transformations that will make our lives easier downstream.

educ

Impose an ordering on the education levels.

Let's convert this category into a numeric variable.

This is the only way we can ensure the correct ordering. Don't trust an encoder (i.e.. OrdinalEncoder to get it right).

wage

Next, let's log wage. Note that we add an offset because we have zeros in the data. (log of zero is negative infinity)

race

Let's collapse the racial categories. For some of the categories, there is very little representation in the data. By collapsing, we increase these bin sizes.

mar

Convert categories to dummies

cit

Convert to a dummy variable.

coverage

Finally, our outcome coverage needs to be numeric (0/1)

Re-split

We learned that we had to make these changes from the training data. Now, let's re-split using the changed data.

Train Models

Cross Validation

Initialize Pipeline

Note that we still want to scale values, but we want to do this in the pipeline since it utilizes information we'll only learn from the training data (i.e. min/max, mean, std, etc.)

Select Models & Tuning Parameters

As before, the grid search to tune the models is pretty tame here so that the code will run quickly. Often a good idea to explore more of the tuning parameter space when running your own code.

Run Models

Put it all together in a GridSearch

And Run

Best ROC AUC.

Performance

Model Interpretation

Permutation Importance

Permute the features to determine importance. Note here that I only do this 5 times for the sake of runtime, but you'd want to do this more (e.g. 25 times)

Organize the output as a data frame.

Visualize

Partial Dependency Plots

Interaction Partial Dependency Plots (2D)

ICE Plots

Global Surrogate Models

(1) Generate a vector of predictions (specifically, predicted probabilities since this is a classification problem).

(2) Fit the surrogate model on the predictions

(3) Examine the model fit ($R^2$)

(4) Plot the tree and interpret