PPOL564 | Data Science 1 | Foundations Week 13 Interpretable Machine Learning

class: center, middle, inverse, title-slide

# PPOL564 | Data Science 1 | Foundations Week 13 Interpretable Machine Learning 
###  Prof. Eric Dunford  ◆  Georgetown University  ◆  McCourt School of Public Policy  ◆  <a href="mailto:eric.dunford@georgetown.edu" class="email">eric.dunford@georgetown.edu</a>

---

layout: true

<div class="slide-footer"> 
PPOL564 | Data Science 1 | Foundations

&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;

Week 13

&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;

Interpretable Machine Learning

</div>

---

## Model Interpretation

- Knowing a model is predictive is _necessary_ but rarely _sufficient_ to draw **_substantive insights_**.

- In the social sciences, we are interested in understanding **_why_** certain features matter in an effort to detect potential **_interventions_**: if we change `\(X\)` will we get a different outcome?

- Interpretability offers insights into the features the model **_relies on to make its prediction_**.

- In addition, interpretability is a useful debugging tool for **_detecting bias_** in machine learning models.

- Model needs to be a fairly **_good approximation of the data generating process_** (i.e. high predictive accuracy) for interpretation to matter

---

## Variable Importance

![:space 2]

- Variable/Feature importance is concerned with how much a given model **_relies on a set of variables/features to make accurate predictions_**.

- If those variables/features were removed, the model should **_perform worse_** (i.e. diminished predictive capacity).

- Determining variable importance helps with **_variable selection_**.

- What variables could we drop from the model (not contributing much information)? 
  
  - What variables should we make sure to always measure and use in the model?

---

## Variable Importance

Consider output from a simple multivariate variable OLS regression, what variables seem to matter most?

![:space 3]

```
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.00776 0.0314 -0.247 0.805
## 2 x1 2.00 0.0318 62.9 0 
## 3 x2 -0.00812 0.0312 -0.260 0.795
## 4 x3 0.0515 0.0321 1.60 0.109
```

![:space 3]

- `x1` is clearly both substantively and statistically significant. We should keep it in the model.

---

## Variable Importance

![:space 2]

- Some models offer a natural way of determining importance:

- _Regression_: coefficient and test statistic size
  
  - _Trees_: split importance
  
![:space 2]

- But other models are more complicated (e.g. support vector machines, KNN, Neural Networks). We call these **_black box_** models because it's difficult to "peer inside" the model to understand how it works.

![:space 2]

- We need ways of determining predictive performance that are **_model agnostic_** (i.e. doesn't depend on the type of model you use).

---

### Permutation Importance

- Permutation Importance offers a model agnostic way to determine variable importance.

- The idea: **_scramble the data_** one variable at a time and see if the predictive performance of the model _decreases_.

- How it works:

+ **_Train_** a model
  
  + **_Permute_** (i.e. scramble the order) a single variable/feature in the training data.
  
  + Use the model to **_predict_** on the data with the permuted variable
  
  + See if there is a **_drop in predictive performance_**
  
  + **_Repeat_**
  
---

### The logic of permuting

![:space 5]

- Permuting a variable effectively **_breaks the statistical relationship_** between outcome and predictor.

- If a **_variable doesn't matter, then permuting it won't change the performance_** (as the model already doesn't rely on this variable )

- We must permute each variable **_multiple times_** as permuting is a random process 
  
  + We want to ensure a specific importance ordering isn't a results of a single permutation.

---

## Partial Dependence Plots (PDP)

![:space 5]

- Variable importance cannot tell us how variables **_relate_** to the outcome.

- Partial dependency plots show the **_marginal effect_** one or two features have on the predicted outcome of the model.

- A partial dependence plot can show whether the **_relationship_** between the target and a feature is linear, monotonic or more complex.

- The partial dependence plot is a **_global method_**: The method considers all instances and gives a statement about the global relationship of a feature with the predicted outcome.

---

## Partial Dependence Plots (PDP)

![:space 5]

- The steps:
  
  + Train a model
  
  + Identify the features that matter most (feature importance)
  
  + Manipulate the values of those features (one at a time) and take the average prediction, holding all other features at their observed values. 
  
  + Plot the values and interpret the curve.

---

##  Individual Conditional Expectation Plots (ICE)

- Partial dependency offers a plot of the **_average marginal effect_**; however, can obscure a heterogeneous relationship created by **_interactions_**.

---

##  Individual Conditional Expectation Plots (ICE)

- Partial dependency offers a plot of the **_average marginal effect_**; however, can obscure a heterogeneous relationship created by **_interactions_**.

![:space 3]

- ICE plots plots show the **_marginal effect for each observation_** in the data.

- We can observe if there are **_divergence_** or **_convergence_** in the predicted effect cross observations.

- The PDP is just the average taken across the different ICE curves.