PPOL564 - Data Science I: Foundations

Lecture 17

Eigen Decompositions as Data Reduction

In [1]:
import pandas as pd
import numpy as np
import numpy.linalg as la
import seaborn as sns
import matplotlib.pyplot as plt
import requests
import warnings
warnings.filterwarnings("ignore")

def download_data(git_loc,dest_name):
    '''
    Download data from Github and save to the notebook's working directory.
    '''
    req = requests.get(git_loc)
    with open(dest_name,"w") as file:
        for line in req.text:
            file.writelines(line)
            
download_data('https://raw.githubusercontent.com/edunford/ppol564/master/lectures/lecture_17/icews-repressive-actions-count.csv',
              "icews-repressive-actions-count.csv")

Data

The following is data drawn from the ICEWS event data set. The ICEWS is a machine coding procedure that picks over an international database of news articles and codes in real time international and domestic events. The coding project seeks to classify a broad range of political activity for the purposes of prediction of political instability. For more on ICEWS and the CAMEO coding scheme in which events are cataloged and binned, see here and the codebook at Harvard's Dataverse.

ICEWS seeks to track everything political. For todays purposes, I've filtered and cleaned these data to only reflect indicators that reflect "repressive" behavior of states toward their domestic populations for 5 countries from 1995 to 2014: China, Nigeria, United Kingdom, United States, and Zimbabwe. The data have been aggregated to the country-year.

In a way, we can think of these data as the observable record of state repressive activities to its domestic population. We want to use these data to build a measure of state repressive behavior by year.

In [2]:
dat = pd.read_csv('icews-repressive-actions-count.csv')
print(f'''
There are {dat.shape[0]} observations in the data and {dat.shape[1]} variables
''')
dat.head()
There are 100 observations in the data and 77 variables

Out[2]:
country year polity2 regime_type abduct,_hijack,_or_take_hostage accuse accuse_of_aggression accuse_of_crime,_corruption accuse_of_espionage,_treason accuse_of_human_rights_abuses ... threaten_to_reduce_or_break_relations threaten_to_reduce_or_stop_aid threaten_with_administrative_sanctions torture use_as_human_shield use_conventional_military_force use_tactics_of_violent_repression use_unconventional_violence veto violate_ceasefire
0 China 1995.0 -7 Non-democracy 0 12 0 0 1 1 ... 1 0 0 1 0 8 6 1 0 0
1 China 1996.0 -7 Non-democracy 0 29 0 0 0 3 ... 0 0 0 0 0 29 5 0 0 0
2 China 1997.0 -7 Non-democracy 2 28 0 2 0 1 ... 0 0 0 2 0 22 7 1 0 0
3 China 1998.0 -7 Non-democracy 1 22 0 0 0 2 ... 0 0 0 0 0 27 18 1 0 0
4 China 1999.0 -7 Non-democracy 0 34 0 0 1 3 ... 0 0 0 2 0 74 13 0 0 0

5 rows × 77 columns

As you'll quickly note, there are a large number of variables in these data. Some of the variables correspond with the unit of analysis (e.g. country and year) whereas others correspond with important covariates (e.g. polity2 and regime_type). The remaining variables, however, are counts of specific repressive event types as coded by ICEWS' automated CAMEO coding scheme.

These variables are represented as counts, where the reported integer corresponds to the number of times that event type occurred within that respective country-year.

Let's list off the entire list of variables contained in the dataset. As reported above there are 73 variables that report state repressive behavior.

In [3]:
# Let's subset our data to only include the repressive indicators
D_sub = dat.drop(['country','year','polity2','regime_type'],axis=1)
In [4]:
D_sub.columns
Out[4]:
Index(['abduct,_hijack,_or_take_hostage', 'accuse', 'accuse_of_aggression',
       'accuse_of_crime,_corruption', 'accuse_of_espionage,_treason',
       'accuse_of_human_rights_abuses', 'accuse_of_war_crimes',
       'arrest,_detain,_or_charge_with_legal_action', 'assassinate',
       'attempt_to_assassinate', 'ban_political_parties_or_politicians',
       'bring_lawsuit_against', 'carry_out_suicide_bombing', 'coerce',
       'complain_officially', 'conduct_hunger_strike',
       'conduct_strike_or_boycott',
       'conduct_suicide,_car,_or_other_non-military_bombing',
       'confiscate_property', 'criticize_or_denounce', 'defy_norms,_law',
       'demonstrate_military_or_police_power', 'demonstrate_or_rally',
       'destroy_property', 'detonate_nuclear_weapons', 'employ_aerial_weapons',
       'engage_in_mass_killings', 'expel_or_deport_individuals',
       'expel_or_withdraw', 'expel_or_withdraw_peacekeepers', 'give_ultimatum',
       'halt_mediation', 'halt_negotiations',
       'impose_administrative_sanctions', 'impose_blockade,_restrict_movement',
       'impose_curfew', 'impose_embargo,_boycott,_or_sanctions',
       'impose_state_of_emergency_or_martial_law',
       'increase_military_alert_status', 'increase_police_alert_status',
       'kill_by_physical_assault', 'mobilize_or_increase_armed_forces',
       'mobilize_or_increase_police_power', 'obstruct_passage,_block',
       'occupy_territory', 'physically_assault', 'protest_violently,_riot',
       'rally_opposition_against', 'reduce_or_break_diplomatic_relations',
       'reduce_or_stop_economic_assistance',
       'reduce_or_stop_humanitarian_assistance',
       'reduce_or_stop_military_assistance', 'reduce_relations', 'reject',
       'reject_mediation', 'reject_proposal_to_meet,_discuss,_or_negotiate',
       'seize_or_damage_property', 'sexually_assault', 'threaten',
       'threaten_non-force',
       'threaten_to_ban_political_parties_or_politicians',
       'threaten_to_halt_negotiations',
       'threaten_to_impose_state_of_emergency_or_martial_law',
       'threaten_to_reduce_or_break_relations',
       'threaten_to_reduce_or_stop_aid',
       'threaten_with_administrative_sanctions', 'torture',
       'use_as_human_shield', 'use_conventional_military_force',
       'use_tactics_of_violent_repression', 'use_unconventional_violence',
       'veto', 'violate_ceasefire'],
      dtype='object')

It should become immediately clear that we have a problem here. There are too many variables seeking to capture different dimensions of state repression. This is even more of a problem given the limited number of observations that we have.

If we were to use every variable, we'd run into estimation problems, but if we were to only use one or two repression dimensions, then we'd essentially be throwing away important information about the concept we care about.

Below we'll explore using dimension reduction using nothing more than the linear algebra concepts that we've covered thus far in class. Specifically, the eigendecomposition!

Exploring the Data: How do our repressive indicators relate to one another?

In [5]:
corr = D_sub.corr()
corr.round(2)
Out[5]:
abduct,_hijack,_or_take_hostage accuse accuse_of_aggression accuse_of_crime,_corruption accuse_of_espionage,_treason accuse_of_human_rights_abuses accuse_of_war_crimes arrest,_detain,_or_charge_with_legal_action assassinate attempt_to_assassinate ... threaten_to_reduce_or_break_relations threaten_to_reduce_or_stop_aid threaten_with_administrative_sanctions torture use_as_human_shield use_conventional_military_force use_tactics_of_violent_repression use_unconventional_violence veto violate_ceasefire
abduct,_hijack,_or_take_hostage 1.00 0.63 0.13 0.35 0.25 0.51 0.25 0.32 0.26 0.38 ... -0.01 0.22 0.29 0.75 0.28 0.71 -0.06 0.53 0.23 0.44
accuse 0.63 1.00 0.37 0.26 0.17 0.47 0.50 0.41 0.41 0.36 ... 0.28 0.49 0.26 0.45 0.29 0.79 -0.10 0.70 0.29 0.12
accuse_of_aggression 0.13 0.37 1.00 -0.10 0.03 -0.03 0.81 -0.02 0.14 -0.05 ... 0.17 0.20 0.08 0.16 -0.02 0.34 -0.17 0.16 -0.01 0.09
accuse_of_crime,_corruption 0.35 0.26 -0.10 1.00 -0.01 -0.00 -0.06 0.37 0.12 0.33 ... 0.00 0.04 0.02 0.10 0.21 0.24 -0.00 0.21 -0.11 0.02
accuse_of_espionage,_treason 0.25 0.17 0.03 -0.01 1.00 0.19 0.18 -0.03 0.09 -0.02 ... 0.09 0.11 0.17 0.21 0.07 0.03 0.17 0.00 0.27 -0.07
accuse_of_human_rights_abuses 0.51 0.47 -0.03 -0.00 0.19 1.00 0.05 0.00 0.38 0.32 ... 0.13 0.42 0.04 0.35 0.39 0.40 -0.12 0.36 0.13 0.03
accuse_of_war_crimes 0.25 0.50 0.81 -0.06 0.18 0.05 1.00 0.07 0.20 -0.05 ... 0.14 0.17 0.14 0.20 0.09 0.38 -0.15 0.19 0.25 0.04
arrest,_detain,_or_charge_with_legal_action 0.32 0.41 -0.02 0.37 -0.03 0.00 0.07 1.00 0.11 0.24 ... 0.16 -0.03 0.31 0.21 0.07 0.34 0.45 0.22 0.12 0.02
assassinate 0.26 0.41 0.14 0.12 0.09 0.38 0.20 0.11 1.00 0.10 ... 0.19 0.57 0.34 0.14 0.15 0.43 -0.15 0.18 0.23 -0.02
attempt_to_assassinate 0.38 0.36 -0.05 0.33 -0.02 0.32 -0.05 0.24 0.10 1.00 ... 0.09 0.22 -0.02 0.21 -0.03 0.17 0.01 0.15 -0.03 -0.03
ban_political_parties_or_politicians 0.20 0.16 -0.04 0.70 -0.02 -0.16 -0.02 0.31 0.01 0.23 ... -0.07 -0.12 -0.09 -0.04 -0.05 0.05 -0.04 0.00 -0.05 0.08
bring_lawsuit_against 0.54 0.57 0.20 0.44 -0.02 0.29 0.18 0.32 0.20 0.25 ... 0.18 0.29 0.23 0.34 0.22 0.52 -0.20 0.41 0.06 0.18
carry_out_suicide_bombing 0.33 0.32 0.35 0.02 0.08 0.22 0.39 0.06 0.12 -0.04 ... 0.00 0.04 -0.02 0.17 0.01 0.29 -0.10 0.42 -0.04 0.09
coerce 0.47 0.43 0.04 0.34 -0.01 0.13 0.12 0.69 0.07 0.35 ... 0.08 0.03 0.23 0.32 0.05 0.43 0.31 0.30 0.11 0.13
complain_officially 0.36 0.38 0.03 0.10 0.21 0.22 0.17 0.37 0.21 0.02 ... 0.08 0.22 0.42 0.36 0.14 0.32 0.04 0.19 0.47 0.09
conduct_hunger_strike 0.33 0.33 -0.02 0.13 0.03 0.35 0.08 0.18 0.13 0.01 ... 0.02 0.04 -0.07 0.14 0.77 0.30 -0.07 0.39 -0.02 -0.01
conduct_strike_or_boycott 0.19 0.30 0.13 0.05 0.08 0.09 0.08 0.10 0.03 0.09 ... 0.19 0.18 0.02 0.07 0.01 0.19 -0.18 0.11 0.02 -0.01
conduct_suicide,_car,_or_other_non-military_bombing 0.20 0.48 0.27 -0.00 0.04 0.23 0.37 0.25 0.23 0.02 ... 0.02 0.15 0.28 0.11 0.13 0.33 -0.17 0.35 0.47 -0.08
confiscate_property 0.17 0.32 -0.04 -0.01 0.19 0.06 0.03 0.56 -0.03 0.07 ... 0.10 -0.07 0.33 0.24 0.05 0.32 0.41 0.39 0.11 0.11
criticize_or_denounce 0.58 0.90 0.46 0.15 0.16 0.47 0.56 0.29 0.45 0.21 ... 0.37 0.58 0.24 0.43 0.33 0.77 -0.17 0.61 0.24 0.15
defy_norms,_law 0.38 0.56 0.21 -0.05 0.18 0.50 0.29 0.19 0.37 0.39 ... 0.36 0.51 0.22 0.31 0.05 0.34 -0.07 0.25 0.16 -0.01
demonstrate_military_or_police_power 0.46 0.61 0.15 0.02 0.26 0.26 0.39 0.26 0.41 0.13 ... 0.12 0.38 0.55 0.40 0.00 0.42 -0.09 0.25 0.70 0.02
demonstrate_or_rally 0.21 0.36 0.13 0.07 -0.01 0.25 0.04 0.12 0.18 0.11 ... 0.20 0.25 0.15 0.12 0.09 0.31 -0.07 0.44 0.02 0.05
destroy_property 0.42 0.25 0.06 0.35 -0.08 0.03 0.07 0.28 -0.01 0.13 ... -0.01 -0.02 0.16 0.32 0.01 0.46 0.08 0.36 0.07 0.36
detonate_nuclear_weapons 0.04 0.39 0.30 0.04 0.01 -0.01 0.33 -0.02 0.14 0.04 ... 0.52 0.32 -0.05 0.03 0.06 0.19 -0.15 0.09 -0.03 -0.02
employ_aerial_weapons 0.57 0.59 0.32 0.01 0.02 0.30 0.37 0.09 0.21 0.08 ... 0.12 0.28 0.24 0.59 0.11 0.81 -0.14 0.62 0.16 0.46
engage_in_mass_killings 0.21 0.28 0.03 0.02 0.08 0.29 0.09 -0.04 0.10 0.22 ... 0.12 0.34 -0.01 0.16 -0.03 0.22 -0.16 0.14 0.21 0.06
expel_or_deport_individuals 0.48 0.56 0.05 0.16 0.10 0.38 0.09 0.26 0.20 0.30 ... 0.22 0.32 0.14 0.40 0.21 0.53 -0.01 0.60 0.10 0.16
expel_or_withdraw -0.01 -0.04 -0.10 -0.14 0.21 0.03 -0.07 0.10 -0.07 -0.07 ... -0.02 -0.15 0.01 0.01 0.10 -0.08 0.35 -0.01 -0.06 -0.05
expel_or_withdraw_peacekeepers 0.03 0.05 -0.06 -0.11 0.12 0.05 -0.07 -0.10 -0.03 0.09 ... -0.02 -0.03 -0.04 0.02 -0.05 0.03 0.00 0.05 0.00 0.05
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
obstruct_passage,_block -0.01 0.05 -0.02 0.04 -0.06 -0.06 -0.03 0.01 -0.04 -0.02 ... -0.06 -0.05 -0.04 -0.02 -0.02 0.00 -0.02 0.34 -0.02 -0.02
occupy_territory 0.24 0.55 0.14 0.09 -0.04 0.22 0.18 0.13 0.06 0.02 ... -0.04 0.26 0.07 0.25 0.29 0.66 0.01 0.81 -0.01 0.17
physically_assault 0.41 0.25 -0.07 0.16 0.30 0.17 -0.04 0.50 -0.04 0.17 ... -0.03 -0.09 0.37 0.52 0.10 0.31 0.52 0.29 0.16 0.19
protest_violently,_riot 0.45 0.39 0.16 0.25 0.09 0.22 0.27 0.24 0.15 0.03 ... 0.07 -0.02 0.17 0.35 0.55 0.45 -0.05 0.55 0.11 0.17
rally_opposition_against 0.39 0.74 0.44 0.07 0.12 0.52 0.44 0.12 0.49 0.35 ... 0.41 0.65 0.08 0.23 0.24 0.51 -0.24 0.41 0.06 -0.04
reduce_or_break_diplomatic_relations 0.52 0.56 0.10 0.03 0.26 0.61 0.23 0.09 0.34 0.30 ... 0.16 0.60 0.10 0.30 0.17 0.40 -0.18 0.31 0.14 -0.02
reduce_or_stop_economic_assistance 0.43 0.56 0.43 -0.00 0.05 0.43 0.40 -0.03 0.45 0.11 ... 0.19 0.72 0.19 0.43 0.09 0.67 -0.30 0.45 0.13 0.32
reduce_or_stop_humanitarian_assistance 0.11 0.28 0.29 -0.05 0.22 0.12 0.22 0.07 0.40 -0.03 ... 0.18 0.41 0.20 0.02 0.06 0.22 -0.06 0.04 0.04 -0.06
reduce_or_stop_military_assistance 0.21 0.50 0.13 -0.04 0.18 0.17 0.18 -0.02 0.20 0.08 ... 0.30 0.69 0.03 0.14 0.02 0.37 -0.22 0.33 0.11 0.03
reduce_relations 0.40 0.54 0.10 0.36 0.04 0.21 0.18 0.63 0.16 0.31 ... 0.26 0.20 0.19 0.26 0.16 0.52 0.12 0.45 0.04 0.09
reject 0.61 0.90 0.32 0.31 0.12 0.44 0.38 0.42 0.38 0.28 ... 0.29 0.44 0.29 0.42 0.31 0.80 -0.11 0.70 0.21 0.18
reject_mediation 0.13 0.18 0.22 0.02 -0.06 0.06 0.08 0.11 -0.00 0.13 ... -0.08 0.25 0.13 0.10 -0.04 0.09 -0.11 0.02 0.05 -0.04
reject_proposal_to_meet,_discuss,_or_negotiate 0.53 0.77 0.46 -0.00 0.20 0.57 0.57 0.12 0.35 0.32 ... 0.27 0.56 0.18 0.37 0.22 0.55 -0.27 0.38 0.26 0.03
seize_or_damage_property 0.23 0.22 0.10 0.10 -0.02 -0.02 0.09 0.55 0.11 0.05 ... 0.12 0.09 0.33 0.20 0.02 0.32 0.30 0.14 0.03 0.26
sexually_assault 0.29 0.44 0.09 0.02 -0.03 0.30 0.15 0.12 0.26 0.27 ... 0.10 0.17 0.05 0.18 0.24 0.34 -0.16 0.21 0.04 -0.00
threaten 0.50 0.87 0.33 0.11 0.18 0.48 0.41 0.18 0.38 0.26 ... 0.30 0.67 0.16 0.37 0.32 0.72 -0.21 0.65 0.21 0.09
threaten_non-force 0.01 0.09 -0.09 0.24 -0.06 -0.10 -0.01 0.34 -0.01 -0.06 ... -0.08 -0.11 0.05 0.01 -0.01 0.03 0.06 -0.02 -0.06 -0.05
threaten_to_ban_political_parties_or_politicians 0.21 0.20 -0.04 0.23 0.08 0.01 0.19 0.18 0.13 -0.03 ... 0.02 0.02 0.20 0.11 -0.02 0.10 -0.09 0.04 0.35 -0.02
threaten_to_halt_negotiations -0.06 0.13 0.14 -0.09 -0.02 0.04 0.12 0.15 -0.01 -0.03 ... 0.20 0.05 -0.05 -0.05 0.01 0.09 -0.11 0.12 -0.03 -0.02
threaten_to_impose_state_of_emergency_or_martial_law 0.02 0.00 -0.05 0.26 -0.04 -0.10 -0.04 0.09 0.00 0.18 ... -0.02 -0.07 0.01 -0.02 -0.02 0.02 0.02 -0.03 -0.03 -0.02
threaten_to_reduce_or_break_relations -0.01 0.28 0.17 0.00 0.09 0.13 0.14 0.16 0.19 0.09 ... 1.00 0.26 0.06 -0.06 -0.07 0.13 -0.03 -0.01 -0.04 -0.09
threaten_to_reduce_or_stop_aid 0.22 0.49 0.20 0.04 0.11 0.42 0.17 -0.03 0.57 0.22 ... 0.26 1.00 0.16 0.18 0.01 0.43 -0.26 0.29 0.14 0.02
threaten_with_administrative_sanctions 0.29 0.26 0.08 0.02 0.17 0.04 0.14 0.31 0.34 -0.02 ... 0.06 0.16 1.00 0.43 -0.02 0.34 0.15 0.13 0.51 0.14
torture 0.75 0.45 0.16 0.10 0.21 0.35 0.20 0.21 0.14 0.21 ... -0.06 0.18 0.43 1.00 0.11 0.65 0.07 0.48 0.25 0.46
use_as_human_shield 0.28 0.29 -0.02 0.21 0.07 0.39 0.09 0.07 0.15 -0.03 ... -0.07 0.01 -0.02 0.11 1.00 0.33 -0.08 0.43 -0.02 -0.02
use_conventional_military_force 0.71 0.79 0.34 0.24 0.03 0.40 0.38 0.34 0.43 0.17 ... 0.13 0.43 0.34 0.65 0.33 1.00 -0.06 0.82 0.16 0.45
use_tactics_of_violent_repression -0.06 -0.10 -0.17 -0.00 0.17 -0.12 -0.15 0.45 -0.15 0.01 ... -0.03 -0.26 0.15 0.07 -0.08 -0.06 1.00 -0.02 -0.13 0.04
use_unconventional_violence 0.53 0.70 0.16 0.21 0.00 0.36 0.19 0.22 0.18 0.15 ... -0.01 0.29 0.13 0.48 0.43 0.82 -0.02 1.00 0.05 0.35
veto 0.23 0.29 -0.01 -0.11 0.27 0.13 0.25 0.12 0.23 -0.03 ... -0.04 0.14 0.51 0.25 -0.02 0.16 -0.13 0.05 1.00 -0.02
violate_ceasefire 0.44 0.12 0.09 0.02 -0.07 0.03 0.04 0.02 -0.02 -0.03 ... -0.09 0.02 0.14 0.46 -0.02 0.45 0.04 0.35 -0.02 1.00

73 rows × 73 columns

In [6]:
plt.figure(figsize=(15,15))
# Heat map of the correlations...
g = sns.heatmap(corr, cmap='coolwarm')
#Apply ticks
g = plt.xticks(range(len(corr.columns)), corr.columns)
g = plt.yticks(range(len(corr.columns)), corr.columns)

What can we say from the correlation of these variables?

Eigen Decomposition

As we saw last time, we can decompose a square matrix into a matrix of eigenvectors and eigenvalues. Here we will use our covariance matrix $\Sigma$



$$ \Sigma = \textbf{V}\Lambda\textbf{V}^{-1} $$



Where $\textbf{V}$ is a matrix of eigenvectors and $\Lambda$ is a diagonal matrix of eigenvalues.



When decomposing a covariance matrix, we are aiming to find the eigenvectors that maximum the variance of data. In a sense, the eigenvector with the largest eigenvalue tells most of the story in the data. What we are interested in is finding a single vector that broadly convey most of the information without having to use all of the data.

In [7]:
# Decompose the covariance matrix
sigma = D_sub.cov()
evals,evecs = la.eig(sigma)

Let's look at our eigenvalues. What do these tell us?

In [8]:
evals.round(3)
Out[8]:
array([1.6705711e+05, 5.1008294e+04, 1.6071909e+04, 2.6249700e+03,
       1.4014640e+03, 9.0077300e+02, 7.0484600e+02, 5.6308500e+02,
       3.3004100e+02, 2.2765100e+02, 2.3428700e+02, 1.6815800e+02,
       1.3379300e+02, 1.0015100e+02, 8.4977000e+01, 5.8784000e+01,
       5.2017000e+01, 4.7654000e+01, 4.0190000e+01, 3.6322000e+01,
       3.3430000e+01, 2.7946000e+01, 2.4945000e+01, 1.9384000e+01,
       1.8430000e+01, 1.7319000e+01, 1.4263000e+01, 1.3373000e+01,
       1.0286000e+01, 9.1440000e+00, 7.3380000e+00, 6.1890000e+00,
       5.8080000e+00, 5.0660000e+00, 4.6990000e+00, 4.3700000e+00,
       3.7230000e+00, 3.1460000e+00, 2.7640000e+00, 2.5170000e+00,
       2.0970000e+00, 1.9050000e+00, 1.6620000e+00, 1.3570000e+00,
       1.1990000e+00, 1.2810000e+00, 1.2870000e+00, 9.6800000e-01,
       6.9100000e-01, 5.8900000e-01, 5.2500000e-01, 4.2800000e-01,
       3.2200000e-01, 2.8600000e-01, 2.1100000e-01, 1.5900000e-01,
       1.5500000e-01, 1.1500000e-01, 1.0700000e-01, 9.1000000e-02,
       8.3000000e-02, 7.8000000e-02, 5.8000000e-02, 4.7000000e-02,
       3.7000000e-02, 3.0000000e-02, 2.6000000e-02, 1.7000000e-02,
       1.3000000e-02, 8.0000000e-03, 2.0000000e-03, 5.0000000e-03,
       4.0000000e-03])

Let's think back to what an eigenvalue is. It's the degree to which our eigenvector is being scaled up or down. Given that a covariance matrix is positively defined (why is this?), values are only scale up here.

What can we say about the eigenvectors with extremely small eigenvalues?

Explained Variance

Let's change how we evaluate the eigenvalues. Rather than think of them as scalars, let's think about them as weights. These weights determine the importance of a particular eigenvector. We want to choose the eigenvector that explains the most variation (i.e. tells the most of our data's story!).

In [9]:
# Let's convert our eigen values into proportions
variance_explained = evals/sum(abs(evals))
In [10]:
plt.figure(figsize=(10,5))
plt.plot(variance_explained,marker='o')
plt.xlabel("Components") 
plt.ylabel("Proportion of Explained Variation") 
plt.ylim(0,1)
plt.show()
In [11]:
print('Proportion of Variation Explained')
for i,val in enumerate(variance_explained):
    print(f'''
    Eigenvalue {i+1} accounts for {round(val*100,2)}% of the variance
    ''')
Proportion of Variation Explained

    Eigenvalue 1 accounts for 69.0% of the variance
    

    Eigenvalue 2 accounts for 21.07% of the variance
    

    Eigenvalue 3 accounts for 6.64% of the variance
    

    Eigenvalue 4 accounts for 1.08% of the variance
    

    Eigenvalue 5 accounts for 0.58% of the variance
    

    Eigenvalue 6 accounts for 0.37% of the variance
    

    Eigenvalue 7 accounts for 0.29% of the variance
    

    Eigenvalue 8 accounts for 0.23% of the variance
    

    Eigenvalue 9 accounts for 0.14% of the variance
    

    Eigenvalue 10 accounts for 0.09% of the variance
    

    Eigenvalue 11 accounts for 0.1% of the variance
    

    Eigenvalue 12 accounts for 0.07% of the variance
    

    Eigenvalue 13 accounts for 0.06% of the variance
    

    Eigenvalue 14 accounts for 0.04% of the variance
    

    Eigenvalue 15 accounts for 0.04% of the variance
    

    Eigenvalue 16 accounts for 0.02% of the variance
    

    Eigenvalue 17 accounts for 0.02% of the variance
    

    Eigenvalue 18 accounts for 0.02% of the variance
    

    Eigenvalue 19 accounts for 0.02% of the variance
    

    Eigenvalue 20 accounts for 0.02% of the variance
    

    Eigenvalue 21 accounts for 0.01% of the variance
    

    Eigenvalue 22 accounts for 0.01% of the variance
    

    Eigenvalue 23 accounts for 0.01% of the variance
    

    Eigenvalue 24 accounts for 0.01% of the variance
    

    Eigenvalue 25 accounts for 0.01% of the variance
    

    Eigenvalue 26 accounts for 0.01% of the variance
    

    Eigenvalue 27 accounts for 0.01% of the variance
    

    Eigenvalue 28 accounts for 0.01% of the variance
    

    Eigenvalue 29 accounts for 0.0% of the variance
    

    Eigenvalue 30 accounts for 0.0% of the variance
    

    Eigenvalue 31 accounts for 0.0% of the variance
    

    Eigenvalue 32 accounts for 0.0% of the variance
    

    Eigenvalue 33 accounts for 0.0% of the variance
    

    Eigenvalue 34 accounts for 0.0% of the variance
    

    Eigenvalue 35 accounts for 0.0% of the variance
    

    Eigenvalue 36 accounts for 0.0% of the variance
    

    Eigenvalue 37 accounts for 0.0% of the variance
    

    Eigenvalue 38 accounts for 0.0% of the variance
    

    Eigenvalue 39 accounts for 0.0% of the variance
    

    Eigenvalue 40 accounts for 0.0% of the variance
    

    Eigenvalue 41 accounts for 0.0% of the variance
    

    Eigenvalue 42 accounts for 0.0% of the variance
    

    Eigenvalue 43 accounts for 0.0% of the variance
    

    Eigenvalue 44 accounts for 0.0% of the variance
    

    Eigenvalue 45 accounts for 0.0% of the variance
    

    Eigenvalue 46 accounts for 0.0% of the variance
    

    Eigenvalue 47 accounts for 0.0% of the variance
    

    Eigenvalue 48 accounts for 0.0% of the variance
    

    Eigenvalue 49 accounts for 0.0% of the variance
    

    Eigenvalue 50 accounts for 0.0% of the variance
    

    Eigenvalue 51 accounts for 0.0% of the variance
    

    Eigenvalue 52 accounts for 0.0% of the variance
    

    Eigenvalue 53 accounts for 0.0% of the variance
    

    Eigenvalue 54 accounts for 0.0% of the variance
    

    Eigenvalue 55 accounts for 0.0% of the variance
    

    Eigenvalue 56 accounts for 0.0% of the variance
    

    Eigenvalue 57 accounts for 0.0% of the variance
    

    Eigenvalue 58 accounts for 0.0% of the variance
    

    Eigenvalue 59 accounts for 0.0% of the variance
    

    Eigenvalue 60 accounts for 0.0% of the variance
    

    Eigenvalue 61 accounts for 0.0% of the variance
    

    Eigenvalue 62 accounts for 0.0% of the variance
    

    Eigenvalue 63 accounts for 0.0% of the variance
    

    Eigenvalue 64 accounts for 0.0% of the variance
    

    Eigenvalue 65 accounts for 0.0% of the variance
    

    Eigenvalue 66 accounts for 0.0% of the variance
    

    Eigenvalue 67 accounts for 0.0% of the variance
    

    Eigenvalue 68 accounts for 0.0% of the variance
    

    Eigenvalue 69 accounts for 0.0% of the variance
    

    Eigenvalue 70 accounts for 0.0% of the variance
    

    Eigenvalue 71 accounts for 0.0% of the variance
    

    Eigenvalue 72 accounts for 0.0% of the variance
    

    Eigenvalue 73 accounts for 0.0% of the variance
    

Let's change our language a bit here. Rather than referring constantly to the eigenvalue/eigenvector, let's call the combination a "principal component".

The above plot and printout shows that most of the variance (97%) can be explained by the first three eigenvectors. The rest of the eigenvectors account for very little of the variation. That is, their eigenvalues barely scale the eigenvectors.

We can read this as there are really only three real underlying dimensions that explain the variation in the data matrix.

This is useful because we can reduce our 73 indicators down to three principal components that retain and account for the underlying variation in the data.

Generating Scores

We can now project our data down into three dimensions using our eigenvectors/eigenvalues. Let's subset our matrix of eigenvectors that correspond with the first three eigenvalues that explain the most variation in the data.

The eigenvectors can be thought of as weights that describe how each indicator loads onto the underlying dimension.



$$ \textbf{X}_{n \times p} \textbf{V}^*_{p \times s} = \textbf{S}_{n \times s}$$



where $\textbf{X}$ is our original ${n\times p}$ data matrix, $\textbf{V}^* \subset \textbf{V}~|~s < p$ of eigenvectors and $\textbf{S}$ is an ${n \times s}$ matrix of projected scores.

In [12]:
# Extract the first three eigenvectors
weights = evecs[:,[0,1,2]]
weights.shape
Out[12]:
(73, 3)

We can now project our data onto these weights to create three variables that project onto these subspaces.

In [13]:
X = D_sub.values                      # Convert data to the numerical matrix
reduced_data = X.dot(weights)         # Project data onto the eigenvectors
reduced_data.shape
Out[13]:
(100, 3)
In [14]:
# Let's save this to our data to our data frame
def standardize(x):
    '''
    Set mean equal to 0 and variance equal to 1
    '''
    return (x-np.mean(x))/np.std(x)

dat['comp_1'] = standardize(reduced_data[:,0])
dat['comp_2'] = standardize(reduced_data[:,1])
dat['comp_3'] = standardize(reduced_data[:,2])
dat2 = dat[['country','year','polity2','regime_type','comp_1','comp_2','comp_3']]
dat2.head()
Out[14]:
country year polity2 regime_type comp_1 comp_2 comp_3
0 China 1995.0 -7 Non-democracy 0.755175 0.460266 0.316986
1 China 1996.0 -7 Non-democracy 0.617021 0.118523 0.277331
2 China 1997.0 -7 Non-democracy 0.698467 0.555486 0.188908
3 China 1998.0 -7 Non-democracy 0.576978 -0.074490 0.216587
4 China 1999.0 -7 Non-democracy 0.303260 -0.683403 0.137371

Validation?

In [15]:
g = sns.pairplot(dat2[['comp_1','comp_2','comp_3','regime_type']],hue="regime_type",height=4)
In [16]:
f, axes = plt.subplots(1, 3,figsize=(15,5))
g = sns.boxplot(y="comp_1", x="regime_type", data=dat2,ax=axes[0])
g = sns.boxplot(y="comp_2", x="regime_type", data=dat2,ax=axes[1])
g = sns.boxplot(y="comp_3", x="regime_type", data=dat2,ax=axes[2])

The above poses an interesting question: Is reduced data still useful even if we can't associate the collapsed dimensions with a concept?