PPOL670 | Introduction to Data Science for Public Policy Week 10 Applications in Supervised Learning Regression

# PPOL670 | Introduction to Data Science for Public Policy Week 10 Applications in Supervised Learning Regression 
###  Prof. Eric Dunford  ◆  Georgetown University  ◆  McCourt School of Public Policy  ◆  <a href="mailto:eric.dunford@georgetown.edu" class="email">eric.dunford@georgetown.edu</a>

---

<div class="slide-footer"> 
PPOL670 | Introduction to Data Science for Public Policy

&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;

Week 10

&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;

Regression

</div>

---
class: outline

# Outline for Today

- **Linear Models**

- **Classification and Regression Trees**

- **Bagging and Random Forests**

---

# Regression Methods

---

### Supervised Learning

The aim of supervised machine learning is to model the relationship between an outcome and some set of features.

`$$y = f(X) + \epsilon$$`

where

- `$y$` is the outcome/dependent/response variable

- `$X$` is a matrix of predictors/features/independent variables

- `$f$` is some fixed but unknown function mapping `$X$` to `$y$`. The "signal" in the data.

- `$\epsilon$` is some random error term. The "noise" in the data.

---

### Regression vs. Classification

- **Quantitative** outcome (Today)

+ a continuous/interval-based outcome: e.g. housing price, number of bills passed, stock market prices, etc.
  + Regression: linear models, generalized linear models, penalization, generalized additive models (GAMs) 
  + Both parametric and non-parametric ways of approximating `$f(\cdot)$`
  
- **Qualitative** outcome (Next Week)

+ a discrete outcome: binary, ordered, categorical
  + Classification: logistic regression, naive Bayes, support vector machines, neural networks
  
- Some methods can be used on either outcome type

---

### Selecting Models `$\hat{f}(X)$`

In machine learning, we'll select different algorithms to approximate the "true" functional form `$\hat{f}(X)$`

`$$f(X) \approx \hat{f}(X)$$`

$$ \hat{y} =  \hat{f}(X)$$

We'll evaluate a model's predictive performance on out-of-sample data (using cross validation). Our aim is to find the most predictive model given the available features.

`$$e_i = y_i - \hat{y_i}$$`

---

## Linear Models

Linear approximation of continous outcome.

`$$y_i = \beta_0 + \beta_1x_{1i} + \epsilon_i$$`

`$\beta_0$` and `$\beta_1$` are **_unknown_** constants that represent the _intercept_ and _slope_, which we call the model's **_parameters_**. We learn the parameters from the data.

![:space 5]

We can predict future values once we've estimated the parameters of the models, but inputing new data.

`$$\hat{y}_i = \hat{\beta}_0 + \hat{\beta_1}x_{1i}^{new}$$`

---

## Linear Models 
<img src="lecture-week-10-applications-supervised-ml-regression-ppol670_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" />

---

## Linear Models 
<img src="lecture-week-10-applications-supervised-ml-regression-ppol670_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" />

---

## Linear Models 
<img src="lecture-week-10-applications-supervised-ml-regression-ppol670_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" />

---

## Linear Models 
<img src="lecture-week-10-applications-supervised-ml-regression-ppol670_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" />

---

### Minimizing the Residual Sum of Squared Errors

![:space 10]
.center[<img src = "Figures/min_rss.png", width = 800> ]

---

### Assessing Fit

**Error**

`$$RSS = e_1^2 + e_2^2 + \dots + e_n^2 =  \sum^N_{i=1} (y_i -\hat{\beta}_0 + \hat{\beta_1}x_{1i} )^2$$`

`$$MSE = \frac{\sum^N_{i=1} (y_i - \hat{y})^2}{N}$$`

**Explained Variation**

`$$R^2 = \frac{TSS - RSS}{TSS} =  1 - \frac{RSS}{TSS}$$`

---

# Multiple Linear Regression

`$$y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} +  \dots + \beta_px_{pi} + \epsilon_i$$`

---

### High Bias, Low Variance

$$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x $$

---

### Polynomial Regression: building in _flexibility_

`$$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x +  \hat{\beta}_2 x^2$$`

---

### Polynomial Regression: building in _flexibility_

`$$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x +  \hat{\beta}_2 x^2 + \hat{\beta}_2 x^3$$`

---

### Polynomial Regression: building in _flexibility_

`$$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x +  \hat{\beta}_2 x^2 + \hat{\beta}_2 x^3 + \hat{\beta}_4 z + \hat{\beta}_5 xz$$`

---

### Low Bias, High Variance

`$$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x +  \hat{\beta}_2 z + \hat{\beta}_3 xz + \sum_{k=2}^7 \hat{\beta}_{k+2} x^k + \sum_{k=2}^7 \hat{\beta}_{k+9} x^kz$$`

---

### Parametric vs. Non-parametric

- Linear regression is an example of a **_parametric_** model, which is to say it assumes a _linear functional form_ for `$f(X)$`.

- Parametric methods have their **_advantages_**:

+ Easy to fit
  + Easy to interpret
  + Can use statistical methods (to calculate uncertainty and test significance)
  
- Parametric methods have their **_disadvantages_** as well:
  
  + Strong assumptions about the form of `$f(X)$`
  + Biased &rarr; poor predictive performance

---

### Parametric vs. Non-parametric

- **_Non-parametric_** methods to not explicitly assume a parametric form for `$f(X)$`.

- Provides a more flexible approach for performing regression.

- **_Advantages_**
  
  - Makes no assumptions about the form of `$f(X)$`
  - Low bias &rarr; better predictive performance
  
- **_Disadvantages_**

- Difficult to interpret
  - High variance &rarr; new data, new fit.
  
---

### K-Nearest Neighbors

One of the simplest non-parametric methods is **_K-nearest neighbors regression_**.

![:space 3]

Given a value of `$K$` and prediction point `$x_0$`, KNN regression:

1. Identify K training observations that are closest to `$x_0$`
  - Denoted by set `$N_0$`
  
2. Estimates `$f(x_0)$` using the _average of all the training responses_ in `$N_0$`.

`$$\hat{f}(x_0) = \frac{1}{K} \sum_{x_i \in N_0} y_i$$`
---

### K-Nearest Neighbors

![:space 10]

---

### K-Nearest Neighbors

- `$K$` the number of neighbors we select is a **tuning paramater**

- Can't learn it from the data
  
  - Select it before we run anything
  
  - Will determine how flexible the model is (bias-variance tradeoff)
  
    - low `$K$` &rarr; Low bias, High variance
    - high `$K$` &rarr; High bias, low variance
    
- We can find a reasonable value for `$K$` using **_cross validation_**

---

### K-Nearest Neighbors

<img src="lecture-week-10-applications-supervised-ml-regression-ppol670_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" />
---

### K-Nearest Neighbors

---

### K-Nearest Neighbors

---

### K-Nearest Neighbors

---

### K-Nearest Neighbors

---

### To parametric or not to parametric, that is the question...

> The parametric approach (like linear regression) will outperform the nonparametric approach (like KNN regression) if the parametric form that has been selected is close to the "true form" of `$f(X)$`.

---

# Regression Trees

---

---

### Hitting Averages and Experience on Salary

---

The goal is to ﬁnd boxes ( `$R_1 ,\dots , R_j$` ) that minimize the RSS, given by

`$$\sum^J_{j=1} \sum_{i \in R_j} (y_i - \hat{y}_{R_j})^2$$`

where `$\hat{y}_{R_j}$` is the mean response for the training observations within the `$j$`th box.

We take a top-down, greedy approach that is known as **_recursive binary splitting_**

- **Top-down**: start with one region and break from there.

- **Greedy**: best split is made at each step (best split given the other splits that have been made)

---

### The essence of recursive binary splitting

- (1) From the predictors `$X$` select a predictor `$X_j$`

- (2) Find a cutpoint ( `$s$` ) that splits `$X_j$` into two regions that leads to the greatest possible reduction in RSS.

`$$R_1(j,s) = \{X~|~X_j <s\}~\text{and}~ R_2 (j,s) = \{X~|~X_j \ge s\}$$`

- (3) Looking for an `$s$` and `$j$` that minimizes

`$$\sum_{i: ~x_i \in R_1(j,s)} (y_i - \hat{y}_{R_1})^2 + \sum_{i: ~x_i \in R_2(j,s)} (y_i - \hat{y}_{R_2})^2$$`

where `$\hat{y}_{R_1}$` and `$\hat{y}_{R_2}$` are the mean responses for the training data in region 1 ( `$R_1(j,s)$` ) and region 2  ( `$R_2(j,s)$` )

- (4) Repeat the process until a stopping criterion is met

---

### Tree Pruning

---

### Tree Pruning

- Shallow trees (a few splits) can result in underfitting.

- Deep trees (many splits) can result in overfitting

Balance by penalizing depth using a "**_complexity criterion_**" ( `$\alpha$` )

`$$\sum^T_{m=1}\sum_{i:x_i \in R_m} (y_i - \hat{y}_{R_m}) ^2 + \alpha T$$`

- `$T$` indicates the number of terminal nodes, 
- `$R_m$` is the rectangle (i.e. the subset of predictor space) corresponding to the `$m$`th terminal node
- `$\hat{y}_{R_m}$` is the predicted response associated with `$R_m$`.

---

### Tree Pruning as a tuning parameter

.pull-left[The tuning parameter `$\alpha$` controls a trade-off between the subtree’s complexity and its fit to the training data.

- `$\alpha \rightarrow 0$` means deeper tree

- `$\alpha \rightarrow 1$` means a shallow tree

Need to use cross-validation to figure out the right value of `$\alpha$`.
]

]

---

### Regression vs. Trees

---

### Pros and Cons of Trees

Pros:

- Easy to explain/visualize

- Easy handle qualitative predictors

- Can deal well with data inconsistencies

Cons:

- Less predictive accuracy

- Suffer from high variance

---

## Bagging & Random Forest

---

### Bagging

**Bootstrap aggregation**, or **bagging**, is a general-purpose procedure for reducing the variance of a statistical learning method.

The idea:

- take many training sets from the data
- build separate tree one each training set 
- average the predictions

---

### What is bootstrapping?

---

Say we have two groups (people on a bus) and we want to know that their average age is statistically different.

```r
bus1 <- tibble(age=c(5,20,21,14,17,23,26))
bus2 <- tibble(age=c(23,28,37,51,18,30,48))
```

We can just run a difference in means test.

```r
mean(bus1$age)-mean(bus2$age)
```

```
## [1] -15.57143
```

```r
t.test(bus1$age,bus2$age)
```

```
## 
## 	Welch Two Sample t-test
## 
## data:  bus1$age and bus2$age
## t = -2.9014, df = 9.416, p-value = 0.01677
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -27.630735  -3.512122
## sample estimates:
## mean of x mean of y 
##  18.00000  33.57143
```

---

Say we have two groups (people on a bus) and we want to know that their average age is statistically different.

```r
bus1 <- tibble(age=c(5,20,21,14,17,23,26))
bus2 <- tibble(age=c(23,28,37,51,18,30,48))
```

Or we could bootstrap. The key is sampling _with replacement_.

```r
bus1 %>% 
  sample_n(4,replace = T)
```

```
## # A tibble: 4 x 1
## age
## <dbl>
## 1 21
## 2 17
## 3 20
## 4 20
```
]
.pull-right[

```r
bus2 %>% 
  sample_n(4,replace = T)
```

```
## # A tibble: 4 x 1
## age
## <dbl>
## 1 18
## 2 18
## 3 37
## 4 37
```
]

---

Do this many times...

```r
boot_samples = rep(0,100)
for(i in 1:100){
 boot_samp1 = bus1 %>% sample_n(4,replace = T)
 boot_samp2 = bus2 %>% sample_n(4,replace = T)
 mu1 <- mean(boot_samp1$age)
 mu2 <- mean(boot_samp2$age)
 boot_samples[i] <- mu1 - mu2
}
tibble(means = boot_samples) %>% 
ggplot(aes(x=means)) + 
 geom_density(fill="steelblue",color="white",alpha=.5) + 
* geom_vline(xintercept = -15.57143,color="red",size=1) # actual mean
```

---

### Same idea but with trees...

Grow many of trees then average the predictions

`$$\hat{f}_{avg}(x) = \frac{\sum^B_{b=1} \hat{f}_b(x) }{B}$$`

---

### Random Forest

- Issue with bagging is that the trees are **_highly correlated_**.

- One way around this is to also take a **_random sample of predictors_** at each split (in addition to bagging). Algorithm is _not allowed_ to consider a majority of the available predictors.

- Number of predictors the algorithm is able to select  (`mtry`) is a **_tuning parameter_**