PPOL670 | Introduction to Data Science for Public Policy Week 11 Applications in Supervised Learning Classification

# PPOL670 | Introduction to Data Science for Public Policy Week 11 Applications in Supervised Learning Classification 
###  Prof. Eric Dunford  ◆  Georgetown University  ◆  McCourt School of Public Policy  ◆  <a href="mailto:eric.dunford@georgetown.edu" class="email">eric.dunford@georgetown.edu</a>

---

<div class="slide-footer"> 
PPOL670 | Introduction to Data Science for Public Policy

&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;

Week 11

&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;

Classification

</div>

---
class: outline

# Outline for Today

- **Classification Problems**

- **Discuss Classification Performance Metrics**

- **Logistic Regression**

- **K Nearest Neighbors**

- **Classification Trees**

- **Support Vector Machines**

---

# Classification

---

---

### Decision Boundary

---

# Classification Performance Metrics

---

### How did we do?

- Our aim is to model the signal, not the noise. As we've seen, model over-fitting is a real problem, but re-sampling methods can offer us a way out.

- Central to any machine learning task is how we choose to define "good" performance.

- When dealing with quantitative outcomes (intervals), we can utilize metrics like MSE to assess performance.

`$$MSE = \frac{\sum^N_{i=1} (y_i - \hat{f}(X_i))^2}{N}$$`

---

### How did we do?

- Our aim is to model the signal, not the noise. As we've seen, model over-fitting is a real problem, but re-sampling methods can offer us a way out.

- Central to any machine learning task is how we choose to define "good" performance.

- When dealing with quantitative outcomes (intervals), we can utilize metrics like MSE to assess performance.

- When dealing with qualitative outcome (categories), we need to rely on different metrics to assess performance.

`$$\text{Accuracy} = \frac{\text{Correctly Classified}}{\text{Total Possible}}$$`
`$$\text{Error} = 1 - \text{Accuracy}$$`

---

### The Weather Today

Consider if we were testing the accuracy of two weather persons. Below are their forecasts for the weather in a given week alongside the observed weather pattern. (For now, let's just focus on binary outcomes: sunny day or rainy day)

.center[
|Weather Person | M | Tu | W | Th | F | St | Su |
|---------------|---|----|---|----|---|----|----|
| `$WP_1$` Prediction | Rain | Sun  | Rain | Sun | Sun | Rain | Rain |
| `$WP_2$` Prediction | Sun  | Sun  | Sun  | Sun | Sun | Sun  | Sun  |
| Actual            | Sun  | Sun  | Rain | Sun | Sun | Sun  | Sun |
]

.center[
|Weather Person | Correct | Total | Accuracy | Error |
|---------------|---------|-------|----------|-------|
| `$WP_1$`        |    4    |   7   |   57.1%  | 42.9% |
| `$WP_2$`        |    6    |   7   |   85.7%  | 14.3% |
]

If we calculate the accuracy for each, it looks as if Weather Person 2 is the most accurate. Does that make sense?

---

### The Weather Today

Rain is **rare**. We can always have high accuracy if we just guess sun every day. This is generates a problem if what people care about is when to pack an umbrella!

---

### Confusion Matrix

.center[
|                       |  `$Positive_{~~\text{Actual}}$` |  `$Negative_{~~\text{Actual}}$` |
|-----------------------|----------|----------|
| `$Positive_{~~\text{Predicted}}$`  |   True Positive (TP)       | False Positive (FP)          |
| `$Negative_{~~\text{Predicted}}$`  |   False Negative (FN)       |  True Negative (TN)         |

]

| Metric | Calculation |  Description |
|---|-----| -----|
| Accuracy | `$\frac{TP + TN}{TP+FP+TN+FN}$` | In total, how accurate is the model |
| Precision | `$\frac{TP}{TP+FP}$` | Of the true positives classified, how many are actually positive | 
| Specificity | `$\frac{ TN }{ TN + FP }$` | Of the actual true negatives, how many were correctly classified | 
| Recall/Sensitivity | `$\frac{TP}{ TP + FN}$` | Of the actual true positives, how many were correctly classified |

---

### Weather Person 1

.center[
|                       |  `$Positive_{~~\text{Actual}}$` |  `$Negative_{~~\text{Actual}}$` |
|-----------------------|----------|----------|
| `$Positive_{~~\text{Predicted}}$`  |   3      |  0   |
| `$Negative_{~~\text{Predicted}}$`  |   3      |  1   |

]

- Accuracy = 57.1%

- Precision = 1%

- Specificity = 100%

- Recall = 50%

---

### Weather Person 2

.center[
|                       |  `$Positive_{~~\text{Actual}}$` |  `$Negative_{~~\text{Actual}}$` |
|-----------------------|----------|----------|
| `$Positive_{~~\text{Predicted}}$`  |   6      |  1   |
| `$Negative_{~~\text{Predicted}}$`  |   0      |  0   |

]

- Accuracy = 85.7%

- Precision = 85.7%

- Specificity = 0%

- Recall = 100%

---

### ROC Curves

Consider the following:

- We want to predict how many rainy days (1) there will be, sunny otherwise (0).

- Our model outputs probabilities of a rainy day where 0 means no chance, 1 means it's absolutely going to rain.

```r
# Our estimated probabilities 
est_probs 
```

```
## [1] 0.4 0.7 0.3 0.5 0.9 0.1 0.7
```

---

### ROC Curves

Consider the following:

- We need to convert these probabilities to predictions. We can do this by setting a **threshold**.

```r
threshold = .5
our_preds = as.numeric(est_probs >= threshold)
our_preds
```

```
## [1] 0 1 0 1 1 0 1
```

---

### ROC Curves

Consider the following:

- We can now compare these predictions to the actual values.

```r
table(our_preds,true_values)
```

```
##          true_values
## our_preds 0 1
##         0 2 1
##         1 1 3
```

- Thresholds reflect how sensitive we are to true or false positives. 
  
  + The higher the threshold, the less false positives.
  
  + The lower the threshold, the more false positives but more true positives. 
  
  + **It's another tradeoff!**
  
---

### ROC Curves

Receiver operating characteristic (ROC) curve offers a visual representation of model performance across different potential thresholds.

---

# Logistic Regression

---

## The problem with linear regression

---

## Logistic Regression

---

We need a function `$F(\cdot)$` (which is known as a **_link function_**) that maps our linear combination of independent variables ( `$X\beta$` ) onto a probability space (ranging from 0 to 1).

$$ F(X\beta) \mapsto [0,1]$$

---

---

```r
head(training_data,3)
```

```
## # A tibble: 3 x 3
## y x1 x2
## <int> <dbl> <dbl>
## 1 0 -0.560 1.07 
## 2 1 -0.230 -0.0273
## 3 1 1.56 -0.0333
```

```r
pairs(training_data,col="steelblue")
```

---

### Estimate

```r
#We can easily estimate these models in base R
mod = glm(y ~ x1 + x2, data=training_data,
          family=binomial(link = "logit"))
```

![:space 5]

### Predict

```r
preds = predict(mod, test_data,type = "response")
head(preds)
```

```
##         1         2         3         4         5         6 
## 0.5421829 0.4922359 0.4269535 0.5422376 0.4594173 0.3040995
```

---

### Estimate

```r
#We can easily estimate these models in base R
mod = glm(y ~ x1 + x2, data=training_data,
          family=binomial(link = "logit"))
```

![:space 5]

### Predict

```r
preds = predict(mod, test_data,type = "response")
table(preds > .5, test_data$y)
```

```
##        
##          0  1
##   FALSE 40 19
##   TRUE  13 29
```

---

# `$K$`-Nearest Neighbors

---

---

---

---

---

---

### KNN

---

### KNN

- Non-parametric method that treats inputs as coordinate sets

- Classifies by distance of the new entry (test data) to existing entries (training data).

- Distance can be conceptualized in a number ways. Euclidean distance is common:

`$$distance = \sqrt{(x_{ij} - x_{0j})^2}$$`

- Classification occurs as a "_majority vote_"

`$$Pr(y_{ik} = j~|~X = x_{ik}) = \frac{\sum^K_{k=1} I(y_{ik} =j )}{K}$$`

- Poor performance in high dimensions

---

### `$k$` is a tuning parameter

---

### `$k$` is a tuning parameter

---

# Classification Trees

---

## Refresh on Regression Trees

- The goal is to ﬁnd boxes that minimize the predictive error in our training data.

- **_Recursive Binary Splitting_**

- **Top-down**: start with one region and break from there.

- **Greedy**: best split is made at each step (best split given the other splits that have been made)

- **_Tree Depth_**
  
  - Shallow trees (a few splits) can result in underfitting.

- Deep trees (many splits) can result in overfitting

---

### Classification Trees

- Categorical rather than continuous outcome

- Similar process to a regression tree.

- Predict most commonly occurring class of training observations in the region to which it belongs.

- Use the **_Gini Index_** as a measurement of error

`$$G = \sum^K_{k=1} \hat{p}_{mk} (1-\hat{p}_{mk})$$`

- Gini index gets small if all `$\hat{p}_{mk}$` are close to zero or one ("node purity")

---

---

### Reminder: Regression vs. Trees

---

# Support Vector Machines

---

### Let's Build a Wall

---

### Separating Hyperplane

`$$y_i(\beta_0 + x_{1i}\beta_1 + \dots + x_{pi}\beta_p) > 0, \text{ if }y_i = 1$$`
`$$y_i(\beta_0 + x_{1i}\beta_1 + \dots + x_{pi}\beta_p) < 0, \text{ if }y_i = -1$$`

---

### Maximal Margin Hyperplane

---

### Non-separable

---

### Support Vector Classifier

![:space 10]
.center[
<img src = "Figures/svc_01.png", width = 1000> 
]

---

### Support Vector Classifier

**_Aim_**: maximize the margin that separates most of the training observations but misclassifies only a few observations.

`$$max_{\beta, \epsilon}~M~\text{ subject to } \sum_{j=1}^p \beta_j^2 = 1$$`

`$$y_i(\beta_0 + x_{1i}\beta_1 + \dots + x_{pi}\beta_p) \ge M (1-\epsilon_i)$$`

`$$\epsilon \ge 0, \sum_{i=1}^n \epsilon_i \le C$$`
Where `$C$` is a nonnegative tuning parameter.

- `$C$` dictates how many individuals observations can be on the wrong side of the margin.

- `$C$` &rarr; 0 high bias; `$C$` &rarr; 1 high variability

---

### Tuning `$C$`

---

### Dealing with Non-Linear Boundaries

![:space 5]

---

### Support Vector Machine

Use a (polynomial, radial) _kernel_ to generate a decision boundary.