import pandas as pd
import numpy as np
import numpy.linalg as la
import seaborn as sns
import matplotlib.pyplot as plt
import requests
import warnings
warnings.filterwarnings("ignore")
def download_data(git_loc,dest_name):
'''
Download data from Github and save to the notebook's working directory.
'''
req = requests.get(git_loc)
with open(dest_name,"w") as file:
for line in req.text:
file.writelines(line)
download_data('https://raw.githubusercontent.com/edunford/ppol564/master/lectures/lecture_17/icews-repressive-actions-count.csv',
"icews-repressive-actions-count.csv")
The following is data drawn from the ICEWS event data set. The ICEWS is a machine coding procedure that picks over an international database of news articles and codes in real time international and domestic events. The coding project seeks to classify a broad range of political activity for the purposes of prediction of political instability. For more on ICEWS and the CAMEO coding scheme in which events are cataloged and binned, see here and the codebook at Harvard's Dataverse.
ICEWS seeks to track everything political. For todays purposes, I've filtered and cleaned these data to only reflect indicators that reflect "repressive" behavior of states toward their domestic populations for 5 countries from 1995 to 2014: China, Nigeria, United Kingdom, United States, and Zimbabwe. The data have been aggregated to the country-year.
In a way, we can think of these data as the observable record of state repressive activities to its domestic population. We want to use these data to build a measure of state repressive behavior by year.
dat = pd.read_csv('icews-repressive-actions-count.csv')
print(f'''
There are {dat.shape[0]} observations in the data and {dat.shape[1]} variables
''')
dat.head()
As you'll quickly note, there are a large number of variables in these data. Some of the variables correspond with the unit of analysis (e.g. country
and year
) whereas others correspond with important covariates (e.g. polity2
and regime_type
). The remaining variables, however, are counts of specific repressive event types as coded by ICEWS' automated CAMEO coding scheme.
These variables are represented as counts, where the reported integer corresponds to the number of times that event type occurred within that respective country-year.
Let's list off the entire list of variables contained in the dataset. As reported above there are 73 variables that report state repressive behavior.
# Let's subset our data to only include the repressive indicators
D_sub = dat.drop(['country','year','polity2','regime_type'],axis=1)
D_sub.columns
It should become immediately clear that we have a problem here. There are too many variables seeking to capture different dimensions of state repression. This is even more of a problem given the limited number of observations that we have.
If we were to use every variable, we'd run into estimation problems, but if we were to only use one or two repression dimensions, then we'd essentially be throwing away important information about the concept we care about.
Below we'll explore using dimension reduction using nothing more than the linear algebra concepts that we've covered thus far in class. Specifically, the eigendecomposition!
corr = D_sub.corr()
corr.round(2)
plt.figure(figsize=(15,15))
# Heat map of the correlations...
g = sns.heatmap(corr, cmap='coolwarm')
#Apply ticks
g = plt.xticks(range(len(corr.columns)), corr.columns)
g = plt.yticks(range(len(corr.columns)), corr.columns)
What can we say from the correlation of these variables?
As we saw last time, we can decompose a square matrix into a matrix of eigenvectors and eigenvalues. Here we will use our covariance matrix $\Sigma$
$$ \Sigma = \textbf{V}\Lambda\textbf{V}^{-1} $$
Where $\textbf{V}$ is a matrix of eigenvectors and $\Lambda$ is a diagonal matrix of eigenvalues.
When decomposing a covariance matrix, we are aiming to find the eigenvectors that maximum the variance of data. In a sense, the eigenvector with the largest eigenvalue tells most of the story in the data. What we are interested in is finding a single vector that broadly convey most of the information without having to use all of the data.
# Decompose the covariance matrix
sigma = D_sub.cov()
evals,evecs = la.eig(sigma)
Let's look at our eigenvalues. What do these tell us?
evals.round(3)
Let's think back to what an eigenvalue is. It's the degree to which our eigenvector is being scaled up or down. Given that a covariance matrix is positively defined (why is this?), values are only scale up here.
What can we say about the eigenvectors with extremely small eigenvalues?
Let's change how we evaluate the eigenvalues. Rather than think of them as scalars, let's think about them as weights. These weights determine the importance of a particular eigenvector. We want to choose the eigenvector that explains the most variation (i.e. tells the most of our data's story!).
# Let's convert our eigen values into proportions
variance_explained = evals/sum(abs(evals))
plt.figure(figsize=(10,5))
plt.plot(variance_explained,marker='o')
plt.xlabel("Components")
plt.ylabel("Proportion of Explained Variation")
plt.ylim(0,1)
plt.show()
print('Proportion of Variation Explained')
for i,val in enumerate(variance_explained):
print(f'''
Eigenvalue {i+1} accounts for {round(val*100,2)}% of the variance
''')
Let's change our language a bit here. Rather than referring constantly to the eigenvalue/eigenvector, let's call the combination a "principal component".
The above plot and printout shows that most of the variance (97%) can be explained by the first three eigenvectors. The rest of the eigenvectors account for very little of the variation. That is, their eigenvalues barely scale the eigenvectors.
We can read this as there are really only three real underlying dimensions that explain the variation in the data matrix.
This is useful because we can reduce our 73 indicators down to three principal components that retain and account for the underlying variation in the data.
We can now project our data down into three dimensions using our eigenvectors/eigenvalues. Let's subset our matrix of eigenvectors that correspond with the first three eigenvalues that explain the most variation in the data.
The eigenvectors can be thought of as weights that describe how each indicator loads onto the underlying dimension.
$$ \textbf{X}_{n \times p} \textbf{V}^*_{p \times s} = \textbf{S}_{n \times s}$$
where $\textbf{X}$ is our original ${n\times p}$ data matrix, $\textbf{V}^* \subset \textbf{V}~|~s < p$ of eigenvectors and $\textbf{S}$ is an ${n \times s}$ matrix of projected scores.
# Extract the first three eigenvectors
weights = evecs[:,[0,1,2]]
weights.shape
We can now project our data onto these weights to create three variables that project onto these subspaces.
X = D_sub.values # Convert data to the numerical matrix
reduced_data = X.dot(weights) # Project data onto the eigenvectors
reduced_data.shape
# Let's save this to our data to our data frame
def standardize(x):
'''
Set mean equal to 0 and variance equal to 1
'''
return (x-np.mean(x))/np.std(x)
dat['comp_1'] = standardize(reduced_data[:,0])
dat['comp_2'] = standardize(reduced_data[:,1])
dat['comp_3'] = standardize(reduced_data[:,2])
dat2 = dat[['country','year','polity2','regime_type','comp_1','comp_2','comp_3']]
dat2.head()
g = sns.pairplot(dat2[['comp_1','comp_2','comp_3','regime_type']],hue="regime_type",height=4)
f, axes = plt.subplots(1, 3,figsize=(15,5))
g = sns.boxplot(y="comp_1", x="regime_type", data=dat2,ax=axes[0])
g = sns.boxplot(y="comp_2", x="regime_type", data=dat2,ax=axes[1])
g = sns.boxplot(y="comp_3", x="regime_type", data=dat2,ax=axes[2])
The above poses an interesting question: Is reduced data still useful even if we can't associate the collapsed dimensions with a concept?