class: center, middle, inverse, title-slide #
PPOL670 | Introduction to Data Science for Public Policy
Week 11
Applications in Supervised Learning
Classification
###
Prof. Eric Dunford ◆ Georgetown University ◆ McCourt School of Public Policy ◆
eric.dunford@georgetown.edu
--- layout: true <div class="slide-footer"><span> PPOL670 | Introduction to Data Science for Public Policy           Week 11 <!-- Week of the Footer Here -->              Classification <!-- Title of the lecture here --> </span></div> --- class: outline # Outline for Today <br> - **Classification Problems** - **Discuss Classification Performance Metrics** - **Logistic Regression** - **K Nearest Neighbors** - **Classification Trees** - **Support Vector Machines** --- class: newsection # Classification --- <br> .center[<img src = "Figures/seperability.gif", width = 500>] --- ### Decision Boundary .center[<img src = "Figures/decision-boundary.png", width = 600>] --- class: newsection # Classification Performance Metrics --- ### How did we do? - Our aim is to model the signal, not the noise. As we've seen, model over-fitting is a real problem, but re-sampling methods can offer us a way out. - Central to any machine learning task is how we choose to define "good" performance. -- - When dealing with quantitative outcomes (intervals), we can utilize metrics like MSE to assess performance. `$$MSE = \frac{\sum^N_{i=1} (y_i - \hat{f}(X_i))^2}{N}$$` --- ### How did we do? - Our aim is to model the signal, not the noise. As we've seen, model over-fitting is a real problem, but re-sampling methods can offer us a way out. - Central to any machine learning task is how we choose to define "good" performance. - When dealing with quantitative outcomes (intervals), we can utilize metrics like MSE to assess performance. - When dealing with qualitative outcome (categories), we need to rely on different metrics to assess performance. <br> `$$\text{Accuracy} = \frac{\text{Correctly Classified}}{\text{Total Possible}}$$` `$$\text{Error} = 1 - \text{Accuracy}$$` --- ### The Weather Today Consider if we were testing the accuracy of two weather persons. Below are their forecasts for the weather in a given week alongside the observed weather pattern. (For now, let's just focus on binary outcomes: sunny day or rainy day) .center[ |Weather Person | M | Tu | W | Th | F | St | Su | |---------------|---|----|---|----|---|----|----| | `\(WP_1\)` Prediction | Rain | Sun | Rain | Sun | Sun | Rain | Rain | | `\(WP_2\)` Prediction | Sun | Sun | Sun | Sun | Sun | Sun | Sun | | Actual | Sun | Sun | Rain | Sun | Sun | Sun | Sun | ] -- .center[ |Weather Person | Correct | Total | Accuracy | Error | |---------------|---------|-------|----------|-------| | `\(WP_1\)` | 4 | 7 | 57.1% | 42.9% | | `\(WP_2\)` | 6 | 7 | 85.7% | 14.3% | ] If we calculate the accuracy for each, it looks as if Weather Person 2 is the most accurate. Does that make sense? --- ### The Weather Today Consider if we were testing the accuracy of two weather persons. Below are their forecasts for the weather in a given week alongside the observed weather pattern. (For now, let's just focus on binary outcomes: sunny day or rainy day) .center[ |Weather Person | M | Tu | W | Th | F | St | Su | |---------------|---|----|---|----|---|----|----| | `\(WP_1\)` Prediction | Rain | Sun | Rain | Sun | Sun | Rain | Rain | | `\(WP_2\)` Prediction | Sun | Sun | Sun | Sun | Sun | Sun | Sun | | Actual | Sun | Sun | Rain | Sun | Sun | Sun | Sun | ] .center[ |Weather Person | Correct | Total | Accuracy | Error | |---------------|---------|-------|----------|-------| | `\(WP_1\)` | 4 | 7 | 57.1% | 42.9% | | `\(WP_2\)` | 6 | 7 | 85.7% | 14.3% | ] Rain is **rare**. We can always have high accuracy if we just guess sun every day. This is generates a problem if what people care about is when to pack an umbrella! --- ### Confusion Matrix <br> .center[ | | `\(Positive_{~~\text{Actual}}\)` | `\(Negative_{~~\text{Actual}}\)` | |-----------------------|----------|----------| | `\(Positive_{~~\text{Predicted}}\)` | True Positive (TP) | False Positive (FP) | | `\(Negative_{~~\text{Predicted}}\)` | False Negative (FN) | True Negative (TN) | ] -- <br> | Metric | Calculation | Description | |---|-----| -----| | Accuracy | `\(\frac{TP + TN}{TP+FP+TN+FN}\)` | In total, how accurate is the model | | Precision | `\(\frac{TP}{TP+FP}\)` | Of the true positives classified, how many are actually positive | | Specificity | `\(\frac{ TN }{ TN + FP }\)` | Of the actual true negatives, how many were correctly classified | | Recall/Sensitivity | `\(\frac{TP}{ TP + FN}\)` | Of the actual true positives, how many were correctly classified | --- ### Weather Person 1 <br> .center[ | | `\(Positive_{~~\text{Actual}}\)` | `\(Negative_{~~\text{Actual}}\)` | |-----------------------|----------|----------| | `\(Positive_{~~\text{Predicted}}\)` | 3 | 0 | | `\(Negative_{~~\text{Predicted}}\)` | 3 | 1 | ] <br> - Accuracy = 57.1% - Precision = 1% - Specificity = 100% - Recall = 50% --- ### Weather Person 2 <br> .center[ | | `\(Positive_{~~\text{Actual}}\)` | `\(Negative_{~~\text{Actual}}\)` | |-----------------------|----------|----------| | `\(Positive_{~~\text{Predicted}}\)` | 6 | 1 | | `\(Negative_{~~\text{Predicted}}\)` | 0 | 0 | ] <br> - Accuracy = 85.7% - Precision = 85.7% - Specificity = 0% - Recall = 100% --- ### ROC Curves Consider the following: - We want to predict how many rainy days (1) there will be, sunny otherwise (0). - Our model outputs probabilities of a rainy day where 0 means no chance, 1 means it's absolutely going to rain. <br> ```r # Our estimated probabilities est_probs ``` ``` ## [1] 0.4 0.7 0.3 0.5 0.9 0.1 0.7 ``` --- ### ROC Curves Consider the following: - We need to convert these probabilities to predictions. We can do this by setting a **threshold**. <br><br><br> ```r threshold = .5 our_preds = as.numeric(est_probs >= threshold) our_preds ``` ``` ## [1] 0 1 0 1 1 0 1 ``` --- ### ROC Curves Consider the following: - We can now compare these predictions to the actual values. ```r table(our_preds,true_values) ``` ``` ## true_values ## our_preds 0 1 ## 0 2 1 ## 1 1 3 ``` - Thresholds reflect how sensitive we are to true or false positives. + The higher the threshold, the less false positives. + The lower the threshold, the more false positives but more true positives. + **It's another tradeoff!** --- ### ROC Curves Receiver operating characteristic (ROC) curve offers a visual representation of model performance across different potential thresholds. .center[<img src = "Figures/roc-plot.png", width=400>] --- class: newsection <br> # Logistic Regression --- ## The problem with linear regression <img src="lecture-week-11-applications-supervised-ml-classification-ppol670_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- ## Logistic Regression <img src="lecture-week-11-applications-supervised-ml-classification-ppol670_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- We need a function `\(F(\cdot)\)` (which is known as a **_link function_**) that maps our linear combination of independent variables ( `\(X\beta\)` ) onto a probability space (ranging from 0 to 1). $$ F(X\beta) \mapsto [0,1]$$ .center[<img src="Figures/lin-to-pr-space.gif", width=400 >] --- <img src="lecture-week-11-applications-supervised-ml-classification-ppol670_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- ```r head(training_data,3) ``` ``` ## # A tibble: 3 x 3 ## y x1 x2 ## <int> <dbl> <dbl> ## 1 0 -0.560 1.07 ## 2 1 -0.230 -0.0273 ## 3 1 1.56 -0.0333 ``` ```r pairs(training_data,col="steelblue") ``` <img src="lecture-week-11-applications-supervised-ml-classification-ppol670_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- ### Estimate ```r #We can easily estimate these models in base R mod = glm(y ~ x1 + x2, data=training_data, family=binomial(link = "logit")) ``` ![:space 5] ### Predict ```r preds = predict(mod, test_data,type = "response") head(preds) ``` ``` ## 1 2 3 4 5 6 ## 0.5421829 0.4922359 0.4269535 0.5422376 0.4594173 0.3040995 ``` --- ### Estimate ```r #We can easily estimate these models in base R mod = glm(y ~ x1 + x2, data=training_data, family=binomial(link = "logit")) ``` ![:space 5] ### Predict ```r preds = predict(mod, test_data,type = "response") table(preds > .5, test_data$y) ``` ``` ## ## 0 1 ## FALSE 40 19 ## TRUE 13 29 ``` --- class: newsection <br> # `\(K\)`-Nearest Neighbors --- <img src="lecture-week-11-applications-supervised-ml-classification-ppol670_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> --- <img src="lecture-week-11-applications-supervised-ml-classification-ppol670_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- <img src="lecture-week-11-applications-supervised-ml-classification-ppol670_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> --- <img src="lecture-week-11-applications-supervised-ml-classification-ppol670_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> --- <img src="lecture-week-11-applications-supervised-ml-classification-ppol670_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- ### KNN <br> .center[<img src = "Figures/knn.png", width = 700> ] --- ### KNN - Non-parametric method that treats inputs as coordinate sets - Classifies by distance of the new entry (test data) to existing entries (training data). -- - Distance can be conceptualized in a number ways. Euclidean distance is common: `$$distance = \sqrt{(x_{ij} - x_{0j})^2}$$` -- - Classification occurs as a "_majority vote_" `$$Pr(y_{ik} = j~|~X = x_{ik}) = \frac{\sum^K_{k=1} I(y_{ik} =j )}{K}$$` -- - Poor performance in high dimensions --- ### `\(k\)` is a tuning parameter <br> .center[<img src = "Figures/low-high-k.png", width = 700> ] --- ### `\(k\)` is a tuning parameter .center[<img src = "Figures/knn-overfitting.png", width = 700> ] --- class: newsection # Classification Trees --- ## Refresh on Regression Trees - The goal is to find boxes that minimize the predictive error in our training data. - **_Recursive Binary Splitting_** - **Top-down**: start with one region and break from there. - **Greedy**: best split is made at each step (best split given the other splits that have been made) - **_Tree Depth_** - Shallow trees (a few splits) can result in underfitting. - Deep trees (many splits) can result in overfitting --- ### Classification Trees <br> - Categorical rather than continuous outcome - Similar process to a regression tree. - Predict most commonly occurring class of training observations in the region to which it belongs. - Use the **_Gini Index_** as a measurement of error `$$G = \sum^K_{k=1} \hat{p}_{mk} (1-\hat{p}_{mk})$$` - Gini index gets small if all `\(\hat{p}_{mk}\)` are close to zero or one ("node purity") --- <br><br><br><br> .center[ <img src = "Figures/classification-tree-01.png", width = 1000> ] --- ### Reminder: Regression vs. Trees .center[ <img src = "Figures/reg-v-trees.png", width = 600> ] --- class: newsection # Support Vector Machines --- ### Let's Build a Wall <img src="lecture-week-11-applications-supervised-ml-classification-ppol670_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- ### Separating Hyperplane `$$y_i(\beta_0 + x_{1i}\beta_1 + \dots + x_{pi}\beta_p) > 0, \text{ if }y_i = 1$$` `$$y_i(\beta_0 + x_{1i}\beta_1 + \dots + x_{pi}\beta_p) < 0, \text{ if }y_i = -1$$` .center[ <img src = "Figures/hyperplane.png", width = 1000> ] --- ### Maximal Margin Hyperplane .center[ <img src = "Figures/mm_hyperplane.png", width = 550> ] --- ### Non-separable .center[ <img src = "Figures/nonseparable.png", width = 550> ] --- ### Support Vector Classifier ![:space 10] .center[ <img src = "Figures/svc_01.png", width = 1000> ] --- ### Support Vector Classifier **_Aim_**: maximize the margin that separates most of the training observations but misclassifies only a few observations. `$$max_{\beta, \epsilon}~M~\text{ subject to } \sum_{j=1}^p \beta_j^2 = 1$$` `$$y_i(\beta_0 + x_{1i}\beta_1 + \dots + x_{pi}\beta_p) \ge M (1-\epsilon_i)$$` `$$\epsilon \ge 0, \sum_{i=1}^n \epsilon_i \le C$$` Where `\(C\)` is a nonnegative tuning parameter. - `\(C\)` dictates how many individuals observations can be on the wrong side of the margin. - `\(C\)` → 0 high bias; `\(C\)` → 1 high variability --- ### Tuning `\(C\)` .center[ <img src = "Figures/tune_c.png", width = 530> ] --- ### Dealing with Non-Linear Boundaries ![:space 5] .center[ <img src = "Figures/nonlinear_boundary.png", width = 1000> ] --- ### Support Vector Machine Use a (polynomial, radial) _kernel_ to generate a decision boundary. <br> .center[ <img src = "Figures/svm.png", width = 1000> ]