import pandas as pd
import numpy as np
import numpy.linalg as la
import seaborn as sns
import matplotlib.pyplot as plt
import requests
import warnings
warnings.filterwarnings("ignore")
def download_data(git_loc,dest_name):
'''
Download data from Github and save to the notebook's working directory.
'''
req = requests.get(git_loc)
with open(dest_name,"w") as file:
for line in req.text:
file.writelines(line)
download_data('https://raw.githubusercontent.com/edunford/ppol564/master/lectures/lecture_17/afg-security-survey-2008.csv',
"afg-security-survey-2008.csv")
The Afghanistan Nationwide Quarterly Assessment Research (ANQAR) survey collects information from a subnational random sample of Afghan respondents from Sept. 2008 to May 2015 — resulting in a total of 32 survey waves. The survey gathers information on a variety of metrics, including for example, perceptions of the government, a respondent’s quality of life, the provision of services, and view on occupying forces in the country to name a few.
The ANQAR survey data randomly samples respondents from various administrative districts and provinces in Afghanistan across the relevant time period (see the ANQAR codebook for specific details on how the data was collected). There are a total of 426 districts where respondents are situated in 1 and 34 provinces housed in 7 broad regions.
The following data contains aggregated responses for questions regarding security for each district sampled during the survey's initial wave around September 2008. 5 survey response questions (hereafter "items") are retained in total. These items are
In addition, event data regarding the occurrence of violence within a Anfghan province is retained. The data on event occurrences was generated through reports of activity by the US army.
data = pd.read_csv('afg-security-survey-2008.csv')
data.head()
cols = data.columns[data.columns.str.contains('Q')]
items = data[cols]
S = items.corr()
print("Correlation Table of Security Items")
S.round(2)
As we saw last time, we can decompose a square matrix into a matrix of eigenvectors and eigenvalues. Here we will use our correlation matrix $\rho$
$$ \rho = \textbf{V}\Lambda\textbf{V}^{-1} $$
Where $\textbf{V}$ is a matrix of eigenvectors and $\Lambda$ is a diagonal matrix of eigenvalues.
When decomposing a covariance matrix, we are aiming to find the eigenvectors that maximum the variance of data. In a sense, the eigenvector with the largest eigenvalue tells most of the story in the data. What we are interested in is finding a single vector that broadly convey most of the information without having to use all of the data.
evals,evecs = la.eig(S)
evals
evecs
variance_explained = evals/sum(abs(evals))
# Print out variance figures
print('Proportion of Variation Explained')
for i,val in enumerate(variance_explained):
print(f'''
Eigenvalue {i+1} accounts for {round(val*100,2)}% of the variance
''')
# Plot figure
plt.figure(figsize=(12,5))
plt.plot(variance_explained,marker='o')
plt.xticks([0,1,2,3,4],[1,2,3,4,5])
plt.xlabel("Eigenvalues")
plt.ylabel("Proportion of Explained Variation")
plt.show()
We can now pick the eigenvectors that correspond with the largest eigenvalues and project our data onto them to generate a new scale. Again, all matrix multiplication is a transformation. Here we are using the eigenvectors as weights to project our observed data onto the eigvectors that most explain our data.
$$ \textbf{X}_{n\times p}\textbf{v}_{p\times 1}^* = \textbf{s}_{n\times 1}$$
where $\textbf{X}$ is our original ${n\times p}$ data matrix, $\textbf{v}^*$ is an eigen vector that corresponds with our largest eigenvalue, and $\textbf{s}$ is the resulting column vector with our projected scores for that eigenvector.
data['threat'] = items.dot(evecs[:,0])
data['security'] = items.dot(evecs[:,1])
plt.figure(figsize=(12,5))
plt.hist(data['threat'],color="steelblue",
bins=25,alpha=.5)
plt.hist(data['security'],color="forestgreen",
bins=25,alpha=.5)
plt.xlabel('Scale')
plt.show()
def standardize(x):
'''
Set mean equal to 0 and variance equal to 1
'''
return (x-np.mean(x))/np.std(x)
# Standardize Scales
data['threat'] = standardize(data['threat'])
data['security'] = standardize(data['security'])
plt.figure(figsize=(12,5))
plt.hist(data['threat'],color="steelblue",
bins=25,alpha=.25)
plt.hist(data['security'],color="forestgreen",
bins=25,alpha=.25)
plt.xlabel('Scale')
plt.show()
plt.figure(figsize=(12,5))
g = sns.scatterplot(x='threat',y='security',data=data)
Any scale of security/threat should be correlated with the actual occurrence of violence within a region. Let's explore our metrics to see if this behavior holds using the event_count
and occurrence
variables (drawn from the SIGACTS data)
f, axes = plt.subplots(1, 2,figsize=(12,5))
g = sns.boxplot(x='occurrence',y='threat',data=data,ax=axes[0])
g = sns.boxplot(x='occurrence',y='security',data=data,ax=axes[1])
f, axes = plt.subplots(1, 2,figsize=(12,5))
g = sns.regplot(x='event_count',y='threat',data=data,x_jitter=.1,ax=axes[0])
g = sns.regplot(x='event_count',y='security',data=data,x_jitter=.1,ax=axes[1])