PPOL564 | DS1: Foundations

Lecture 25

Simulation

Today's Topics¶

Simulating data generating processes
Agent Based Models

import pandas as pd
import numpy as np
import scipy.stats as st 
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as sm
import statsmodels.discrete.discrete_model as smd
import warnings
import requests
warnings.filterwarnings("ignore")
plt.style.use('ggplot')

/usr/local/lib/python3.7/site-packages/matplotlib/__init__.py:886: MatplotlibDeprecationWarning: 
examples.directory is deprecated; in the future, examples will be found relative to the 'datapath' directory.
  "found relative to the 'datapath' directory.".format(key))

Simulation¶

Simulating a data generating process for a linear model¶

np.random.seed(5)
N = 1000
x1 = np.random.normal(size=N)
x2 = np.random.normal(size=N)
error = np.random.normal(size=N)
y = 1 + .5*x1 + -2*x2 + 3*x1*x2 + error
D = pd.DataFrame(dict(y=y,x1 = x1, x2 = x2))
D.head()

sm.ols('y ~ x1 + x2 + x1*x2', data=D).fit().summary()

Monte Carlo Simulation¶

Let's simulate the data many times, retaining the the coefficients on each iteration

coefs = []
sims = 1000
N = 100
b = [1,.5,-2,3]
for i in range(sims):
    x1 = np.random.normal(size=N)
    x2 = np.random.normal(size=N)
    error = np.random.normal(size=N)
    y = b[0] + b[1]*x1 + b[2]*x2 + b[3]*x1*x2 + error
    D = pd.DataFrame(dict(y=y,x1 = x1, x2 = x2))
    mod = sm.ols('y ~ x1 + x2 + x1*x2', data=D).fit()
    coefs.append(mod.params.tolist())

# Convert to a array
coefs = np.array(coefs)

fig, axs = plt.subplots(4,1, sharex=True, sharey=True,figsize=(10,8))
sns.distplot(coefs[:,0],kde=False,ax=axs[0])
sns.distplot(coefs[:,1],kde=False,ax=axs[1])
sns.distplot(coefs[:,2],kde=False,ax=axs[2])
sns.distplot(coefs[:,3],kde=False,ax=axs[3])

# Plot the position of the "true" coefficient values. 
axs[0].axvline(b[0],color="darkblue")
axs[1].axvline(b[1],color="darkblue")
axs[2].axvline(b[2],color="darkblue")
axs[3].axvline(b[3],color="darkblue")

# Plot the titles for each subplot
axs[0].title.set_text('Intercept')
axs[1].title.set_text('x1')
axs[2].title.set_text('x2')
axs[3].title.set_text('x1*x2')
fig.show()

Monte Carlo Simulation of a Statistical Error: omitted variable bias¶

Let's now simulate the impact of Omitted variable bias. Let's say there is some unobserved variable $z$ that is correlated with $x_1$ and $x_2$. How would that bias our estimates?

In the below code everything is the same except that we're adding variable $z$ that is correlated with the outcomes and the two predictors.

coefs = []
sims = 1000
N = 100
b = [1,.5,-2,3]
for i in range(sims):
    z = np.random.normal(1,3,size=N) 
    x1 = np.random.normal(size=N) + z
    x2 = np.random.normal(size=N) - z
    error = np.random.normal(size=N)
    y = b[0] + b[1]*x1 + b[2]*x2 + b[3]*x1*x2 + 5*z  + error
    D = pd.DataFrame(dict(y=y,x1 = x1, x2 = x2))
    mod = sm.ols('y ~ x1 + x2 + x1*x2', data=D).fit()
    coefs.append(mod.params.tolist())

# Convert to a array
coefs = np.array(coefs)

As the plots below show, $z$ is a confounding variable. By not including it in the model, we are biasing our estimates of $\beta_1$ and $\beta_2$

fig, axs = plt.subplots(4,1, sharex=True, sharey=True,figsize=(10,8))
sns.distplot(coefs[:,0],kde=False,ax=axs[0])
sns.distplot(coefs[:,1],kde=False,ax=axs[1])
sns.distplot(coefs[:,2],kde=False,ax=axs[2])
sns.distplot(coefs[:,3],kde=False,ax=axs[3])

# Plot the position of the "true" coefficient values. 
axs[0].axvline(b[0],color="darkblue")
axs[1].axvline(b[1],color="darkblue")
axs[2].axvline(b[2],color="darkblue")
axs[3].axvline(b[3],color="darkblue")

# Plot the titles for each subplot
axs[0].title.set_text('Intercept')
axs[1].title.set_text('x1')
axs[2].title.set_text('x2')
axs[3].title.set_text('x1*x2')
fig.show()

Simulating Different Types of Outcomes¶

Binary Outcome¶

np.random.seed(12345)
N = 1000 # Number of observations

# Linear combination of predictors
x1 = np.random.normal(size=N)            # One continuous predictor 
x2 = np.random.binomial(1,p=.2,size=N)   # One binary predictor 
X = np.column_stack([np.ones(N),x1,x2])  # Compose as a matrix

B = np.array([.5,.7,-2])                 # True Coefficients
mu  = X.dot(B)                           # average prob is a function of x1 and x2

# Convert our linear combination into a probability space
pr = st.norm.cdf(mu,0,1) 

# Convert into binary outcome using the Bernoulli distribution
y = np.random.binomial(n=1,p = pr)

# Visualize
sns.pairplot(pd.DataFrame(dict(y=y,x1=x1,x2=x2)),height=3)
plt.show()

smd.Probit(y,X).fit().summary()

Optimization terminated successfully.
         Current function value: 0.489036
         Iterations 6

Count Outcome¶

np.random.seed(12345)
N = 1000 # Number of observations

# Linear combination of predictors
x1 = np.random.normal(size=N)            # One continuous predictor 
x2 = np.random.binomial(1,p=.2,size=N)   # One binary predictor 
X = np.column_stack([np.ones(N),x1,x2])  # Compose as a matrix
                     
B = np.array([.5,.7,-2])                 # True Coefficients
mu  = X.dot(B)                           # average rate is a function of x1 and x2

# Convert our linear combination into a "rate" space (i.e. positively defined)
lamb = np.exp(mu) 

# Convert into binary outcome using the Bernoulli distribution
y = np.random.poisson(lam=lamb)

# Visualize
sns.pairplot(pd.DataFrame(dict(y=y,x1=x1,x2=x2)),height=3)
plt.show()

smd.Poisson(y,X).fit().summary()

Optimization terminated successfully.
         Current function value: 1.414652
         Iterations 7

Agent Based Models¶

"An agent-based model (ABM) is a class of computational models for simulating the actions and interactions of autonomous agents (both individual or collective entities such as organizations or groups) with a view to assessing their effects on the system as a whole. It combines elements of game theory, complex systems, emergence, computational sociology, multi-agent systems, and evolutionary programming. Monte Carlo methods are used to introduce randomness." (Wiki)

Schelling's Model of Segregation ¶

A walkthrough regarding how to construct Schelling's model in Python:

	y	x1	x2
0	1.226925	0.441227	0.542769
1	1.935195	-0.330870	0.400912
2	3.962403	2.430771	0.719705
3	1.224431	-0.252092	-0.021605
4	1.130109	0.109610	0.024040

Dep. Variable:	y	R-squared:	0.928
Model:	OLS	Adj. R-squared:	0.928
Method:	Least Squares	F-statistic:	4305.
Date:	Wed, 04 Dec 2019	Prob (F-statistic):	0.00
Time:	07:52:27	Log-Likelihood:	-1425.5
No. Observations:	1000	AIC:	2859.
Df Residuals:	996	BIC:	2879.
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	0.9721	0.032	30.445	0.000	0.909	1.035
x1	0.5047	0.032	15.657	0.000	0.441	0.568
x2	-2.0575	0.032	-64.081	0.000	-2.120	-1.994
x1:x2	3.0042	0.031	97.633	0.000	2.944	3.065

Omnibus:	4.346	Durbin-Watson:	1.960
Prob(Omnibus):	0.114	Jarque-Bera (JB):	3.615
Skew:	-0.051	Prob(JB):	0.164
Kurtosis:	2.724	Cond. No.	1.11

Dep. Variable:	y	No. Observations:	1000
Model:	Probit	Df Residuals:	997
Method:	MLE	Df Model:	2
Date:	Wed, 04 Dec 2019	Pseudo R-squ.:	0.2895
Time:	07:52:58	Log-Likelihood:	-489.04
converged:	True	LL-Null:	-688.34
		LLR p-value:	2.782e-87

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	0.5017	0.051	9.862	0.000	0.402	0.601
x1	0.7307	0.058	12.701	0.000	0.618	0.843
x2	-2.0212	0.150	-13.473	0.000	-2.315	-1.727

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	0.5047	0.029	17.409	0.000	0.448	0.562
x1	0.6753	0.023	29.288	0.000	0.630	0.721
x2	-1.8784	0.126	-14.852	0.000	-2.126	-1.631