PPOL561 | Accelerated Statistics for Public Policy II Week 4 OLS, Confounders, & Simulation

# PPOL561 | Accelerated Statistics for Public Policy II Week 4 OLS, Confounders, & Simulation
###  Prof. Eric Dunford  ◆  Georgetown University  ◆  McCourt School of Public Policy  ◆  <a href="mailto:eric.dunford@georgetown.edu" class="email">eric.dunford@georgetown.edu</a>

---

<div class="slide-footer"> 
PPOL561 | Accelerated Statistics for Public Policy II

&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;

Week 4

&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;

OLS, Confounders, & Simulation

</div>

---
class: outline

# Outline for Today

- Using **_simulation_** as a tool to better understand statistical concepts

- **_Precision_** of estimates

- Delve into different types of **_confounding_** and how to deal with it:

- **_omitted variable bias_**
  
  - **_collider bias_**
  
  - **_measurement error_**
  
  - **_missingness_**

---

# Simulations

---

## Generating random distributions

.center[| Distribution | Function | Arguments |   
| ----- | ------ |  ---------- |
| Normal (Gaussian) | `rnorm()` | `n=`,`mean=`,`sd=`|
| Binomial | `rbinom()` | `n=`, `size=`, `prob=` |
| Uniform | `runif()` | `n=`, `min=`, `max=` |
| Poisson | `runif()` | `n=`, `lambda=` |
| Negative Binomial | `rnbinom()` | `n=`, `size=`,`prob=`, `mu=`|
| Beta | `rbeta()` | `n=`, `shape1=`, `shape2=` |
| Chi-Squared | `rchisq()` | `n=`, `df=` |
| Exponential | `rexp()` | `n=`, `rate=` |
| Gamma | `rgamma()` | `n=`, `rate=`,`scale=`|
]

And many more...

---

## Getting a feeling for the shape...

---

---

## Why simulate?

![:space 3]

- **We know the answer**: we can specify values for the slope and see if we can recover them.

- **Viable testing ground**: 
  - we can break models on purpose; 
  - try to build the symptoms that cause a model to break down; do the proscribed corrections actually correct?
  - Simulation offers us a way to make sure we're actually solving the problem.

- **Use as a tool to gain an intuitive understanding of statistical concepts**

---

## The Aim

![:space 10]

The goal is to **mimic the properties of the model** that we're aiming to examine.

![:space 5]

For example, to get a best linear unbiased estimator, OLS requires that:

- `$E[\epsilon] = 0$`

- `$var(\epsilon)$` is constant.

- `$cor(\textbf{X},\epsilon) = 0$`

---

## Simulating error

We can easily simulate these assumptions using the following:

```r
error <- rnorm(n = 1000, mean = 0, sd = 1)
hist(error,col="grey30",border="white",binwidth = 10)
```

---

### "Ideal" error

![:space 10]

```r
mean(error) # expected value approx. 0
```

```
## [1] -0.04546332
```

```r
var(error) # constant variance 
```

```
## [1] 0.9677243
```

---

## Simulating an independent variable

This synthetic variable could be **normal**

```r
x <- rnorm(n = 1000, mean = 0, sd = 1)
hist(x,col="steelblue",border="white")
```

---

Or **uniform**

```r
x2 <- runif(n = 1000,min = 0,max = 100)
hist(x2,col="forestgreen",border="white")
```

---

Or **binomial**

```r
x3 <- rbinom(n = 1000,size = 1,prob = .3)
hist(x3,col="gold",border="white")
```

OLS makes no distributional assumptions about the independent variables. Only the dependent variable and the resulting errors.

---

## Simulating the dependent variable

![:space 5]

Recall that `$y_i$`, our continuous outcome, is thought to be a function (linearly related) to our independent variables.

We want to simulate a `$y$` that is a **function** of `$x$`, plus some error.

```r
intercept = 1
slope = 2

# Simulate y as a function of x1 + error
y = intercept + slope*x + error
```

---

![:space 5]

```r
# Plot
plot(x,y,pch = 16, col=scales::alpha('grey30',.5))
```

---

```r
# Estimate a linear model...
model = lm(y ~ x)
alpha = model$coefficients['(Intercept)']
beta = model$coefficients['x']

# Scatter Plot
plot(x,y,pch = 16, col=scales::alpha('grey30',.5))

# Plot the fitted line...
abline(alpha,beta,col="blue",lwd=4) # best linear unbiased estimator
```

---

## Breakout

Simulate the following model,

`$$y_i = 2.5 + -1.5x_i + 1.3z_i - s_i + \epsilon_i$$`

Then do the following:

1. Run a Monte Carlo simulation. Rerun the sim 100 simulations and plot the coefficients as histograms.

2. Try and make the model "wrong"?
  
  - Suggestions:
      - What happens if you changed the variance on the error?
      - What happens if you changed the mean of the error to something other than 0?

---

# Precision of Estimates

---

## Variance of Estimates

Variance of a coefficient estimate in a multivariate model:

`$$var(\hat{\beta_j}) = \frac{\hat{\sigma}^2}{N\times var(x_j)(1-R^2_j)}$$`

where `$R^2_j$` is the `$R^2$` for an "auxiliary regression",

and

`$$\hat{\sigma}^2 = \frac{\sum_{i=1}^N(y_i - \hat{y_i})^2}{N-k}$$`

where `$k$` is the number of parameters in the model.

---

## Auxiliary Regressions

There is a different `$R^2_j$` for each independent variable. If our model is

`$$y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \epsilon_i$$`

there will be two different `$R^2_j$`s:

- `$R^2_1$` is the `$R^2$` from `$x_{1i} = \gamma_1 x_2 + \tau_i$`

- `$R^2_2$` is the `$R^2$` from `$x_{2i} = \phi_1 x_1 + \omega_i$`

These `$R^2_j$`s tell us how much the other variables explain `$x_j$`.

---

## Multicollinearity

Multicollinearity refers to the **strength of linear relationships among independent variables**

![:space 5]

1. Multicollinearity causes the variance of `$\hat{\beta_1}$` to be higher than if there were no multicollinearity.

2. Multicollinearity does not cause the `$\hat{\beta_1}$` estimates to be biased.

3. The standard `$se(\hat{\beta_1})$` produced by OLS accounts for multicollinearity.

---

## Consistency

- Connection to the variance of the coefficient equation

- Connection to statistical power

---

# Confidence Intervals

Interpreting a 95% confidence interval:

1. The lower bound of a 95% confidence interval will be a value of `$\beta_1$` such that there is less than a 2.5% probability of observing a `$\hat{\beta_1}$` as high as the `$\hat{\beta_1}$` actually observed.

2. The upper bound of a 95% confidence interval will be a value of `$\beta_1$` such that there is less than a 2.5% probability of observing a `$\hat{\beta_1}$` as low as the `$\hat{\beta_1}$` actually observed.

.center[
| Confidence Level | Critical Value | Confidence Interval |
| ----- | ------ | -------|
| 90% | 1.64 | `$\hat{\beta_1} \pm 1.64 \times se(\hat{\beta_1})$` |
| 95% | 1.96 | `$\hat{\beta_1} \pm 1.96 \times se(\hat{\beta_1})$` |
| 99% | 2.58 | `$\hat{\beta_1} \pm 2.58 \times se(\hat{\beta_1})$` |
]

---

## Precision: summary

`$$var(\hat{\beta_j}) = \frac{\hat{\sigma}^2}{N\times var(x_j)(1-R^2_j)}$$`

**Four factors influence the variance of multivariate `$\hat{\beta_j}$` estimates**:

1. **Model fit**: the better the model fits, the lower the `$\hat{\sigma}^2$` and `$var(\hat{\beta_j})$` will be.
2. **Variation in `$x_j$`**: the more `$x_j$` varies, the lower the `$var(\hat{\beta_j})$` will be.
3. **Sample size**: the more observations, the lower the `$var(\hat{\beta_j})$` will be.
4. **Multicollinearity**: the less the other independent variables explain `$x_j$`, the lower the `$R^2_j$` and `$var(\hat{\beta_j})$` will be.

---

# Confounding

![:space 5]

**![:text_color white](And what to do about it...)**

---

## Omitted Variable Bias

![:space 5]

---

## Collider Bias

![:space 5]

---

## Collider Bias

Note that **_collider bias_** can occur even one controls on a **_collider's descendant!_**

---

## Backdoor Adjustment

---

## Backdoor Adjustment

---

## Breakout

Simulate the following DAG. The effect size for each relationship should be set at 1. Locate the minimal control set to satisfy the backdoor criteria. (_Extra_: Run your simulation 1000 times for 500 observations and plot the distribution for the effect ( `$\beta$` ) of X on Y.)

---

## Frontdoor Adjustment

![:space 5]

---

## Measurement Error

![:space 5]

---

## Missing Data

![:space 5]

- **Missing completely at random** (MCAR) - missing data is independent of the observed and unobserved data. (Essentially just a form of measurement error.) 
  + Data was randomly shuffled into training and test dataset. The test data fell off the truck.

- **Missing at random** (MAR) - data are systematically missing but on an observable data.

- **Missing not at random** (MNAR) - data are systematically missing due to an unobservable variable.

---

## Missing at random

Example: _Female survey respondents are more likely to complete the survey than their male counterparts. We can observe everyone's gender._

![:space 2]

---

## Missing not at random

Example: _Those more interested in politics are more likely to respond to a political survey than those who are less interested._ We cannot observe someone's latent interest in politics.