class: center, middle, inverse, title-slide #
PPOL561 | Accelerated Statistics for Public Policy II
Week 12
Generalized Linear Models
###
Prof. Eric Dunford ◆ Georgetown University ◆ McCourt School of Public Policy ◆
eric.dunford@georgetown.edu
--- layout: true <div class="slide-footer"><span> PPOL561 | Accelerated Statistics for Public Policy II           Week 12 <!-- Week of the Footer Here -->              GLMs <!-- Title of the lecture here --> </span></div> --- class: outline # Outline for Today <br> - **_Linear Probability Models_** and their discontents <br> - **_Maximum Likelihood Estimation_** for binary responses <br> - It's all about the **_Substantive Effects_** and generating estimates of **_Uncertainty_** around substantive predictions. --- # Limited dependent variables ![:space 5] - **Binary** – vote or not; go to war or not - **Multinomial** – voted for Clinton, Trump, Johnson, or Stein - **Ordinal** – strongly oppose war, somewhat oppose war, somewhat support war, strongly support war - **Counts** – number of bills passed; number of conflicts; number of organizations joined; number of puppies hugged - **Duration** – time to adopt full suffrage; time to bill passage; time to end of regime --- class: newsection # Binary Dependent Variables --- ## The problem When we have a binary outcome (0/1), we can use OLS to estimate the relationship. This is known as a **linear probability model** (LPM). -- We start with the usual equation: `$$y_i = \beta_0 + \beta_1 x_i + \epsilon_i$$` Here `\(y \in [0,1]\)` - The probability of "success": `\(pr(y_i = 1) = p_i\)` - The probability of "failure": `\(pr(y_i = 0) = 1 - p_i\)` -- `\(y\)` follows a **Bernoulli** distribution $$ E(y_i) = 0 \times (1-p_i) + 1 \times p_i = p_i$$ $$ E(y_i | x_i) = \beta_0 + \beta_1 x_i = p_i $$ --- ## The problem <img src="week-12_ppol561_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> -- .center[ <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 0.690 </td> <td style="text-align:right;"> 0.034 </td> <td style="text-align:right;"> 20.222 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> x </td> <td style="text-align:right;"> 0.328 </td> <td style="text-align:right;"> 0.037 </td> <td style="text-align:right;"> 8.756 </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table> ] --- ## The problem <img src="week-12_ppol561_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> .center[ <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:left;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:left;"> 69% </td> <td style="text-align:right;"> 0.034 </td> <td style="text-align:right;"> 20.222 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> x </td> <td style="text-align:left;"> 32.8% </td> <td style="text-align:right;"> 0.037 </td> <td style="text-align:right;"> 8.756 </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table> ] --- ## The problem - OLS may produce **non-sense predictions** (recall that OLS assumes continuous interval level data) -- + But we can **recode** <img src="week-12_ppol561_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- ## The problem - OLS may produce **non-sense predictions** (recall that OLS assumes continuous interval level data) - Disturbances are <u>_not_</u> normally distributed; they follow the Bernoulli distribution. -- <img src="week-12_ppol561_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ## The problem - OLS may produce **non-sense predictions** (recall that OLS assumes continuous interval level data) - Disturbances are <u>_not_</u> normally distributed, they follow the Bernoulli distribution. <img src="week-12_ppol561_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- ## The problem - OLS may produce **non-sense predictions** (recall that OLS assumes continuous interval level data) - Disturbances are <u>_not_</u> normally distributed, they follow the Bernoulli distribution. - `\(y_i\)` is distributed with mean `\(p_i\)` and variance `\(p_i(1-p_i)\)`. `\(p_i\)` is a function of `\(x\)`, so we have **heteroscedasticity** with variance related to the `\(x\)`. + But can use weighted least squares to adjust for this. -- - Linear assumption is suspect - Goodness-of-fit measures are even less helpful --- ## The solution ![:space 5] **_Use a model that fits the functional form of the outcome variable_**! ![:space 5] Two ways of thinking about doing this (both point to the same modeling strategy): ![:space 3] - **_Latent variables_** - **_Generalized linear models_** --- ### Latent variable way of thinking - Recall we only observe `\(y = 1\)` (success) and `\(y = 0\)` (failure). -- - But assume there is some underlying process with a continuous distribution `\(y^*\)`, such that `$$y_i^* = \beta_0 + \beta_1 x_i + \epsilon_i$$` -- - where `\(y_i = 1\)` (success) if `\(y_i^* > \tau\)` (some threshold) and `\(y_i = 0\)` otherwise (failure) `$$y_i = \begin{cases}1~~~\text{if}~y^* > \tau \\ 0~~~\text{if}~y^* \le \tau \end{cases}$$` - The threshold ( `\(\tau\)` ) is some point on the latent distribution. For simplicity, we can set `\(\tau = 0\)`. --- <img src="week-12_ppol561_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- - `\(y^*\)` is continuous, so we avoid the problems encountered using LPM. However, `\(y^*\)` is unobserved, so we cannot estimate it using OLS. - Instead we use Maximum Likelihood Estimations, which requires assumptions about the **distribution of the errors**. -- - Can write the probability of success `\(p_i\)` as (recall `\(\tau = 0\)`) `$$p_i = pr(\beta_0 + \beta_1 x_i + \epsilon_i > 0)$$` `$$p_i = pr(\epsilon_i > -\beta_0 - \beta_1 x_i)$$` `$$p_i = pr(\epsilon_i < \beta_0 + \beta_1 x_i)$$` `$$p_i = F(\beta_0 + \beta_1 x_i)$$` -- - Where `\(F(.)\)` is the cumulative density function (CDF) of `\(\epsilon\)`. We can evaluate the CDF of `\(\epsilon_i\)` at `\(\beta_0 + \beta_1 x_i\)` to get the probability of success. But we need to pick a form for `\(F(.)\)`. --- <img src="week-12_ppol561_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- The **logistic (logit)** and **Normal (probit)** distributions are used frequently. <br><br><br><br> .center[ | | CDF | PDF | |---|-----|---|---| | **Normal** | `\(\Phi(\epsilon) = \int^\epsilon_{-\infty} \frac{1}{(\sqrt{2 \pi})} e^{-\frac{t^2}{2}} dt~~\)` | `\(\phi(\epsilon) = \frac{1}{(\sqrt{2 \pi})} e^{-\frac{\epsilon^2}{2}}\)` | | **Logistic** | `\(\Lambda(\epsilon) = \frac{e^\epsilon}{1 + e^\epsilon}\)` | `\(\lambda(\epsilon) = \frac{e^\epsilon}{(1 + e^\epsilon)^2}\)` | ] --- <img src="week-12_ppol561_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- ### Generalized Linear Model way of thinking - Rather than thinking in terms of a latent variable, let's specify a relationship between `\(x\)` and the probability of an event such that `\(pr(y_i = 1 | x)\)` is a function that ranges from `\(-\infty\)` to `\(\infty\)`. -- - First, we transform the outcome into the **odds** `$$\frac{pr(y_i = 1 | x)}{pr(y_i = 0 | x)} = \frac{pr(y_i = 1 | x)}{1 - pr(y_i = 1 | x)}$$` - **Odds** indicate how often something happens relative to how often it doesn't happen. -- - The **log of the odds** which ranges from `\(-\infty\)` to `\(\infty\)`. `$$ln\begin{bmatrix}\frac{pr(y_i = 1 | x)}{1 - pr(y_i = 1 | x)}\end{bmatrix} = X \beta$$` --- ### Generalized Linear Model way of thinking - This is equivalent to the logit model we've already discussed. `$$pr(y_i = 1|X) = \frac{e^{X\beta}}{1+e^{X\beta}} = \frac{1}{1+e^{-X\beta}}$$` -- - Other probability models can be constructed by choosing functions of `\(X\beta\)` that range from 0 to 1. As we've seen, CDFs have this property. - The CDF of the standard normal distribution results in the probit model. `$$pr(y_i = 1|X) = \int^{X\beta}_{-\infty} \frac{1}{(\sqrt{2 \pi})} e^{-\frac{t^2}{2}} dt = \Phi(X\beta)$$` --- In essence, our function `\(F(\cdot)\)` (which is known as a **link function**) maps our linear combination of independent variables ( `\(X\beta\)` ) onto a probability space (ranging from 0 to 1). $$ F(X\beta) \mapsto [0,1]$$ .center[<img src="Figures/lin-to-pr-space.gif", width=400 >] --- Recall that the outcome `\(y_i\)` follows a **Bernoulli distribution**. `$$y_i \sim bernoulli(p_i)$$` `$$y_i \sim p_i^{y_i}(1-p_i)^{1-y_i}$$` -- <br> We can plug in our linear combination of predictors into this probability distribution. All we need to make sure is that our predictors map to a probablity space (this is our "probability model") $$ \pi_i =\Phi(\beta_0 + \beta_1 x_i) $$ <br> `$$y_i \sim bernoulli(\pi_i)$$` `$$y_i \sim \pi_i^{y_i}(1-\pi_i)^{1-y_i}$$` --- ```r set.seed(123) N = 100 x <- rnorm(N) # random variable y_star <- -1 + .5*x # latent variable as a linear combination # convert to a probability space *pr <- pnorm(y_star) # Drop the probability into a bernoulli dist. (0 or 1) *y <- rbinom(N, size=1, prob = pr) # Gather as dataset dat1 <- tibble(x,y_star,pr,y) head(dat1) ``` ``` ## # A tibble: 6 x 4 ## x y_star pr y ## <dbl> <dbl> <dbl> <int> ## 1 -0.560 -1.28 0.100 0 ## 2 -0.230 -1.12 0.132 1 ## 3 1.56 -0.221 0.413 1 ## 4 0.0705 -0.965 0.167 0 ## 5 0.129 -0.935 0.175 0 ## 6 1.72 -0.142 0.443 1 ``` --- ```r pairs(dat1,col="steelblue") ``` <img src="week-12_ppol561_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> -- ```r mod = glm(y~x,data=dat1,family=binomial(link = "probit")) broom::tidy(mod) ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -1.02 0.166 -6.16 7.32e-10 ## 2 x 0.532 0.180 2.96 3.05e- 3 ``` --- class:newsection ## Maximum Likelihood Estimation --- ### The Steps - (1) Construct a probability model $$ \pi = \Phi(X\beta)$$ -- - (2) Find the probability of the data given the parameters by constructing the likelihood function. `$$pr(y|\beta) = \mathcal{L}(\beta|y) = \prod^n_{i=1}\pi_i^{y_i}(1-\pi_i)^{1-y_i}$$` -- - (3) Take the log of the likelihood and simplify `$$\log \mathcal{L}(\beta | y) = \sum_{i = 1}^n y_i \log \pi_i + \sum_{i = 1}^n (1 - y_i)\log(1 - \pi_i)$$` - (4) Maximize the log likelihood function --- ### Optimize `$$\log \mathcal{L}(\beta | y) = \sum_{i = 1}^n y_i \log [\Phi(X_i\beta)] + \sum_{i = 1}^n (1 - y_i)\log\left[1 - \Phi(X_i\beta)\right]$$` Let's construct a log likelihood function for a probit model. ```r ll.probit <- function(beta, y, X) { # Convert to a probability sapce pr <- pnorm(X%*%beta) # Caluculate the log-likelihood for every observation loglik <- sum(y*log(pr)) + sum((1 - y)*log(1 - pr)) # Return the log-likelihood return(loglik) } ``` -- How might we change this to a logit function? --- ### Optimize <br> - Many ways to **optimize an objective function**: + Nelder and Mead (default for `optim()`) + Newton-Raphson + BFGS (quasi-Newton-Raphson) + Gradient Descent + BFGS + SANN (simulated annealing) + And many more... <br> - `R` holds an `optim()` function has many common optimization algorithms contained within it. --- ```r # optimization function to calculate maximum likelihood probit <- function(y, X) { # Starting values init.par <- rep(0, ncol(X)) # optimizer est <- optim(par = init.par, # starting points fn = ll.probit, # function to optimize y = y, X = X, # Data values control = list(fnscale = -1), # -1 == maximize hessian = TRUE) # return the hessian beta.hat <- est$par # Estimates for beta cov <- solve(-est$hessian) # inverse hess == vcov matrix # Return as list res <- list(beta.hat = beta.hat, cov = cov) return(res) } ``` --- ```r # Recall our simulated model mod <- glm(y~x,data=dat1,family=binomial('probit')) broom::tidy(mod) ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -1.02 0.166 -6.16 7.32e-10 ## 2 x 0.532 0.180 2.96 3.05e- 3 ``` -- ```r y <- dat1$y X <- cbind(1,dat1$x) # 1s for the constant *mod2 <- probit(y,X) # Our homemade MLE optimizer # Estimates mod2$beta.hat ``` ``` ## [1] -1.021831 0.531962 ``` ```r # Standard Errors sqrt(diag(mod2$cov)) ``` ``` ## [1] 0.1652268 0.1777183 ``` --- class:newsection ### Substantive Effects --- ### What is the effect of `x` on `y`? - In OLS, calculating the marginal effect is easy. `$$y = \beta_0 + \beta_1 x$$` `$$\frac{dy}{dx} = \beta_1$$` -- - In MLE, less so... `$$pr(y=1|x) = \Phi(\beta_0 + \beta_1 x)$$` `$$\frac{dy}{dx} pr(y=1|x) = \phi(\beta_0 + \beta_1 x)\beta_1$$` where `\(\phi(\cdot)\)` (the Normal PDF) denotes the first derivative of Normal CDF. -- > Recall the **chain rule**: `\(F(x) = f(g(x)) \to F'(x) = f'(g(x))g'(x)\)` --- ### What is the effect of `x` on `y`? - The effect depends on `\(\Phi(\cdot)\)`, the coefficient on `\(x\)` (i.e. `\(\beta\)`), and the values of <u>**_all the other variables and their respective coefficients_**</u>. -- - In other words the **_model is inherently interactive in all of the variables_**. The effect of a unit change in `\(x\)` is not constant. It depends where we are on the curve. -- - The output from the model merely tells us: + The **sign** of the effect of `\(x\)` on the probability of success + Overall **statistical significance** of the effect - We need to **transform** them in order to determine **_substantive significance_**. --- <img src="week-12_ppol561_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- ### Marginal Effects We can easily calculate and plot the first derivative to get a sense of the marginal effects. ```r b = coefficients(mod) X = model.matrix(mod) head(X) ``` ``` ## (Intercept) x ## 1 1 -0.56047565 ## 2 1 -0.23017749 ## 3 1 1.55870831 ## 4 1 0.07050839 ## 5 1 0.12928774 ## 6 1 1.71506499 ``` ```r *me = dnorm(X%*%b)*b[2] ``` --- ```r dat1 %>% mutate(me=me) %>% ggplot(aes(x,me)) + geom_line() + labs(y="dy/dx") ``` <img src="week-12_ppol561_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> -- But marginal effects don't always offer an intuitive interpretation of the results. --- ### Predicted probabilities - Interest is in the **_substantive significance of the effect_** + Coefficients reveal little + Probabilities are easier to understand than log-odds + The effects depend on where you start on the curve and on the values of all of the other independent variables in the model -- - Approaches for substantive interpretation using predicted probabilities + **_Average Case Approach_**: set all other values to means (modes for dummies) + **_Observed Value Approach_**: set all other values to their observed values --- ### Predicted probabilities - Interest is in the **_substantive significance of the effect_** + Coefficients reveal little + Probabilities are easier to understand than log-odds + The effects depend on where you start on the curve and on the values of all of the other independent variables in the model <br> - **Observed Value Approach** provides a closer connection between the results and the theory and research design (_Hanmer and Kalkan 2013_) - The goal is to estimate the **_average effect_**, _NOT_ the **_effect for the average case._** --- ### Example Data - Post-election survey taken after the 2000 presidential election (roughly 2188 responses in this data. Non-response on the voting outcome are dropped.). - Question: does how many days out voter registration closes impact an individual's propensity to vote? ```r head(elect) ``` ``` ## # A tibble: 6 x 7 ## vote close homeown edu7cat age marriage male ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0 14 1 1 44 0 1 ## 2 1 25 0 1 64 0 0 ## 3 1 29 1 1 68 1 0 ## 4 1 30 1 1 65 1 0 ## 5 0 30 1 1 46 1 1 ## 6 1 29 0 1 59 1 1 ``` --- ### Example Data - Post-election survey taken after the 2000 presidential election (roughly 2188 responses in this data. Non-response on the voting outcome are dropped.). - Question: does how many days out voter registration closes impact an individual's propensity to vote? ```r mod2 <- glm(vote~close+male+age, data=elect,family=binomial('probit')) broom::tidy(mod2) %>% mutate_if(is.numeric,function(x) round(x,3)) ``` ``` ## # A tibble: 4 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -0.288 0.106 -2.71 0.007 ## 2 close -0.006 0.003 -2.20 0.028 ## 3 male -0.004 0.057 -0.073 0.942 ## 4 age 0.019 0.002 11.2 0 ``` --- Who are we trying to predict? ```r b <- coefficients(mod2) z <- b[1] + # intercept b[2]*15 + # close b[3]*1 + # male b[4]*35 # age pnorm(z) # probability of voting ``` ``` ## (Intercept) ## 0.6103261 ``` -- ```r z <- b[1] + # intercept b[2]*0 + # close b[3]*0 + # male b[4]*76 # age pnorm(z) # probability of voting ``` ``` ## (Intercept) ## 0.8775209 ``` --- ### Predicted probabilities <br> <br> **The Recipe**: 1. Isolate a single (discrete or continuous) variable. 2. Manipulate only that value. 3. Calculate the prediction and average across the observations. 4. Compare to some other condition. --- ```r X = model.matrix(mod2) head(X) ``` ``` ## (Intercept) close male age ## 1 1 14 1 44 ## 2 1 25 0 64 ## 3 1 29 0 68 ## 4 1 30 0 65 ## 5 1 30 1 46 ## 6 1 29 1 59 ``` ```r # Alter a condition (Register at the poll) X[,2] = 0 ``` ```r # Predict prob_sameday = pnorm(X%*%b) ``` ```r # average mean(prob_sameday) ``` ``` ## [1] 0.7102054 ``` --- ```r X = model.matrix(mod2) head(X) ``` ``` ## (Intercept) close male age ## 1 1 14 1 44 ## 2 1 25 0 64 ## 3 1 29 0 68 ## 4 1 30 0 65 ## 5 1 30 1 46 ## 6 1 29 1 59 ``` ```r # Alter a condition (Registration closes 30 days out) X[,2] = 30 ``` ```r # Predict prob_30days = pnorm(X%*%b) ``` ```r # average mean(prob_30days) ``` ``` ## [1] 0.6451213 ``` --- ## Discrete Difference We can compare two predictions (where only one data value is altered) by differencing the predictions. -- ```r mean(prob_30days) - mean(prob_sameday) ``` ``` ## [1] -0.06508412 ``` -- <br> When the polls close 30 days prior to election day the likelihood of one voting decreases by 6.5%! -- <br> What's one issue with this conclusion? -- ![:space 5] .center[**_No estimate of uncertainty!_**] --- ## Monte Carlo Simulation - We need to calculate uncertainty around our predictions. _But_ we can’t directly calculate our confidence for any one prediction, so we need to simulate it. -- - The aim is to: + **simulate** all the possible configurations of the **coefficients**, + **compute** the resulting **predictions** for each simulated beta, and then + **calculate** the 95% **confidence interval** using the resulting distribution. -- - To do this, we can: + take repeated random draws from a simulated distribution of the model coefficient + use information from the model regarding the bounds of this distribution (central tendancy, covariance) --- ```r # Simulate the coefficents betas = coefficients(mod2) # coeficient estimates sigma = vcov(mod2) # variance-covariance matrix sigma ``` ``` ## (Intercept) close male age ## (Intercept) 0.0112984032 -1.831202e-04 -1.801604e-03 -1.262345e-04 ## close -0.0001831202 8.383407e-06 3.785560e-06 -1.576920e-07 ## male -0.0018016042 3.785560e-06 3.213478e-03 4.546818e-06 ## age -0.0001262345 -1.576920e-07 4.546818e-06 2.909608e-06 ``` -- Simulate the beta coefficents using a multinormal distribution. We can use the `mvrnorm()` function to take random draws from this distribution using the `MASS` package. ```r set.seed(1234) # Set a seed for reproducibility *sim_betas = MASS::mvrnorm(n=1000,mu=betas,Sigma = sigma) ``` ```r dim(sim_betas) ``` ``` ## [1] 1000 4 ``` --- ```r betas ``` ``` ## (Intercept) close male age ## -0.288148097 -0.006378627 -0.004149490 0.019089884 ``` ```r as_tibble(sim_betas) %>% gather(param,betas) %>% ggplot(aes(betas)) + geom_histogram(color="white",alpha=.5,fill="darkred") + facet_wrap(~param,ncol=4,scales="free") ``` <img src="week-12_ppol561_files/figure-html/unnamed-chunk-41-1.png" style="display: block; margin: auto;" /> --- ```r X = model.matrix(mod2) head(X,3) ``` ``` ## (Intercept) close male age ## 1 1 14 1 44 ## 2 1 25 0 64 ## 3 1 29 0 68 ``` -- ```r X[,2] = 0 # same day voting ``` -- Calculate the average predicted value for every combination of the coefficents ```r n_sims = nrow(sim_betas) probs_sameday <- rep(0,n_sims) for ( i in 1:n_sims){ probs_sameday[i] <- mean(pnorm(X%*%sim_betas[i,])) } *quantile(probs_sameday,probs = c(.025,.975)) ``` ``` ## 2.5% 97.5% ## 0.6645399 0.7521849 ``` --- ```r X = model.matrix(mod2) head(X,3) ``` ``` ## (Intercept) close male age ## 1 1 14 1 44 ## 2 1 25 0 64 ## 3 1 29 0 68 ``` ```r X[,2] = 30 # closes 30 days out ``` Calculate the average predicted value for every combination of the coefficents ```r n_sims = nrow(sim_betas) probs_30days <- rep(0,n_sims) for ( i in 1:n_sims){ probs_30days[i] <- mean(pnorm(X%*%sim_betas[i,])) } *quantile(probs_30days,probs = c(.025,.975)) ``` ``` ## 2.5% 97.5% ## 0.6205824 0.6694931 ``` --- ```r bind_rows( tibble(pred = probs_30days, cond = "30 Days"), tibble(pred = probs_sameday, cond = "Same Day") ) %>% ggplot(aes(pred,fill=cond)) + geom_density(alpha=.6,color="white") ``` <img src="week-12_ppol561_files/figure-html/unnamed-chunk-48-1.png" style="display: block; margin: auto;" /> --- ```r difference = probs_30days-probs_sameday ``` Is the difference between the two predictions statistically signficant? ```r bounds = quantile(difference,probs = c(.025,.975)) tibble( lower = bounds[1], prob = mean(difference), higher = bounds[2], ) ``` ``` ## # A tibble: 1 x 3 ## lower prob higher ## <dbl> <dbl> <dbl> ## 1 -0.120 -0.0648 -0.00550 ``` -- <br><br> **Just barely!** --- <img src="week-12_ppol561_files/figure-html/unnamed-chunk-51-1.png" style="display: block; margin: auto;" /> --- ### `obsval` That's pretty involved! Is there a package? -- Of course... ```r # Install package from Github (makes calculating sim values easy) devtools::install_github("chrismeserole/obsval") require(obsval) ``` -- ```r mod4 <- obsval::obsval(vote~close+male+age, data=elect, reg.model = "probit", effect.var = "close", effect.vals = c(0,10,15,30,60,90)) apply(mod4$preds,2,mean) ``` ``` ## var_0 var_10 var_15 var_30 var_60 var_90 ## 0.7101687 0.6891826 0.6783802 0.6448847 0.5743454 0.5021520 ``` --- ```r as_tibble(mod4$preds) %>% gather(var,val) %>% ggplot(aes(var,val)) + geom_boxplot(size=1) + labs(y="Probability of Voting", x = "Close of Voter Registration") + theme(axis.text=element_text(size=16), axis.title=element_text(size=16)) ``` <img src="week-12_ppol561_files/figure-html/unnamed-chunk-54-1.png" style="display: block; margin: auto;" /> --- ```r elect %>% ggplot(aes(factor(close))) + geom_bar() + labs(x = "Close of Voter Registration",y="Count") ``` <img src="week-12_ppol561_files/figure-html/unnamed-chunk-55-1.png" style="display: block; margin: auto;" /> --- ## Note on the `obsval` output ```r # stored as a list class(mod4) ``` ``` ## [1] "list" ``` ```r # many different features of the prediction names(mod4) ``` ``` ## [1] "model" "sim.coefs" "preds" "means" ## [5] "low.ci" "high.ci" "control.preds" "control.mean" ## [9] "control.low.ci" "control.high.ci" "effect.preds" "effect.mean" ## [13] "effect.low.ci" "effect.high.ci" "effect_sum" "effect.var" ## [17] "reg.model" ``` ```r # The N simulated probabilites for each manipulation dim(mod4$preds) ``` ``` ## [1] 1000 6 ``` --- ## Note on the `obsval` output ```r # Holds all the relevant summary values tibble(cond = colnames(mod4$means), est = as.numeric(mod4$means), low = as.numeric(mod4$low.ci), high = as.numeric(mod4$high.ci)) ``` ``` ## # A tibble: 6 x 4 ## cond est low high ## <chr> <dbl> <dbl> <dbl> ## 1 var_0 0.710 0.662 0.755 ## 2 var_10 0.689 0.658 0.720 ## 3 var_15 0.678 0.654 0.703 ## 4 var_30 0.645 0.622 0.669 ## 5 var_60 0.574 0.495 0.656 ## 6 var_90 0.502 0.364 0.650 ```