Data

The Varieties of Democracy (V-Dem) dataset is a new approach to conceptualizing and measuring democracy. The data provides a multidimensional and disaggregated perspective that reflects the complexity of the concept of democracy as a system of rule that goes beyond the simple presence of elections. The V-Dem project distinguishes between five high-level principles of democracy: electoral, liberal, participatory, deliberative, and egalitarian, and collects data to measure these principles.

The data regularly surveys 3000 expert to construct each measure. Below I’ve selected all metrics concerning civil liberties in a country. Below the following outlines all the V-Dems variables in the data.

  • Freedom from torture (v2cltort)
  • Freedom from political killings (v2clkill)
  • Freedom from forced labor for men (v2clslavem)
  • Freedom from forced labor for women (v2clslavef)
  • Transparent laws with predictable enforcement (v2cltrnslw)
  • Rigorous and impartial public administration (v2clrspct)
  • Access to justice for men (v2clacjstm)
  • Access to justice for women (v2clacjstw)
  • Social class equality in respect for civil liberty (v2clacjust)
  • Social group equality in respect for civil liberties (v2clsocgrp)
  • Freedom of discussion for men (v2cldiscm)
  • Freedom of discussion for women (v2cldiscw)
  • Freedom of academic and cultural expression (v2clacfree)
  • Freedom of religion (v2clrelig)
  • Freedom of foreign movement (v2clfmove)
  • Freedom of domestic movement for men (v2cldmovem)
  • Freedom of domestic movement for women (v2cldmovew)
  • State ownership of economy (v2clstown)
  • Property rights for men (v2clprptym)
  • Property rights for women (v2clprptyw)

The data has been aggregated to the country level (in an effort to keep the file size small). To do this, I first subsetted the data to reflect the post-World War II period and then averaged the scores for the entire time series. In addition, I’ve included a country-average measure of polity, which offers an alternative measure of democracy generated by the Center for Systemic Peace. (More on this below).

Task

Let’s explore some of the clustering methods used in lecture to see if we can cluster countries into basic regime types — democracies/non-democracies — using the V-Dems data of a country’s civil liberties record. We’ll then compare our clustered categories to the averaged polity metric to see if our bins reflect the democracy scale reflected in that metric.

Summarize

Read in the data.

vdems <- read_csv("Data/vdems_civil_liberties.csv")
Parsed with column specification:
cols(
  .default = col_double(),
  country = col_character()
)
See spec(...) for full column specifications.

Quick summary of the data distribution using skimr. Somethings to note:

  • There are 177 observations, meaning there are 177 countries represented in the data.
  • All the V-Dems variables are already scaled (this actually has to do with how the measures are generated from the expert surveys). Generally speaking, since we leverage the concept of “distance” to cluster, we need all the variables to exist in the same space. The only exception here is the polity variable, but we’ll be using that as confirmation of the clusters we draw out of the data rather than an input into the clustering algorithm.
  • There is no missingness in the data.
  • From the mini histograms in the skimr plot, there doesn’t appear to be any significant distributional issues.
skimr::skim(vdems)
── Data Summary ────────────────────────
                           Values
Name                       vdems 
Number of rows             177   
Number of columns          25    
_______________________          
Column type frequency:           
  character                1     
  numeric                  24    
________________________         
Group variables                  

── Variable type: character ──────────────────────────────────────────
  skim_variable n_missing complete_rate   min   max empty n_unique
1 country               0             1     4    32     0      177
  whitespace
1          0

── Variable type: numeric ────────────────────────────────────────────
   skim_variable n_missing complete_rate    mean    sd     p0    p25
 1 polity                0             1  0.980   5.88 -10    -3.75 
 2 v2cltort              0             1  0.215   1.28  -2.22 -0.794
 3 v2clkill              0             1  0.480   1.27  -2.47 -0.604
 4 v2clslavem            0             1  0.595   1.05  -2.19 -0.249
 5 v2clslavef            0             1  0.536   1.03  -1.99 -0.176
 6 v2cltrnslw            0             1  0.258   1.24  -1.79 -0.715
 7 v2clrspct             0             1  0.0599  1.25  -1.99 -0.904
 8 v2clacjstm            0             1  0.288   1.24  -2.32 -0.694
 9 v2clacjstw            0             1  0.260   1.23  -2.96 -0.699
10 v2clacjust            0             1  0.595   1.12  -2.17 -0.178
11 v2clsocgrp            0             1  0.332   1.14  -2.24 -0.568
12 v2cldiscm             0             1  0.200   1.23  -2.88 -0.666
13 v2cldiscw             0             1  0.169   1.19  -2.81 -0.743
14 v2clacfree            0             1  0.233   1.26  -2.80 -0.738
15 v2clrelig             0             1  0.428   1.17  -3.34 -0.388
16 v2clfmove             0             1  0.416   1.15  -3.55 -0.479
17 v2cldmovem            0             1  0.468   1.08  -4.22 -0.220
18 v2cldmovew            0             1  0.374   1.20  -4.24 -0.558
19 v2clstown             0             1 -0.0333  1.06  -3.51 -0.762
20 v2clprptym            0             1  0.479   1.14  -3.41 -0.509
21 v2clprptyw            0             1  0.491   1.22  -2.83 -0.461
22 v2clgencl             0             1  0.497   1.10  -2.80 -0.304
23 v2clgeocl             0             1  0.0856  1.08  -2.50 -0.828
24 v2clpolcl             0             1  0.299   1.16  -2.18 -0.556
       p50   p75  p100 hist 
 1  0.0959 6     10    ▃▇▇▅▇
 2 -0.0369 1.01   3.06 ▃▇▆▃▃
 3  0.476  1.40   2.95 ▂▇▆▇▅
 4  0.824  1.38   2.73 ▁▃▅▇▃
 5  0.652  1.40   2.69 ▂▅▇▇▂
 6  0.0656 1.05   3.29 ▆▇▇▃▂
 7 -0.239  0.826  3.56 ▆▇▅▂▁
 8  0.127  0.983  3.10 ▂▆▇▃▃
 9  0.217  1.05   3.04 ▁▅▇▃▃
10  0.722  1.35   2.93 ▂▅▇▇▃
11  0.381  1.22   2.80 ▂▇▇▇▃
12 -0.0185 1.18   2.86 ▁▆▇▅▃
13  0.0179 1.06   3.03 ▁▆▇▅▃
14  0.151  1.09   2.75 ▁▇▇▇▅
15  0.616  1.39   2.47 ▁▂▅▇▆
16  0.542  1.32   2.45 ▁▂▇▇▆
17  0.694  1.28   2.07 ▁▁▃▆▇
18  0.671  1.29   2.26 ▁▁▆▆▇
19  0.0998 0.818  2.11 ▁▃▆▇▃
20  0.812  1.43   2.18 ▁▂▆▆▇
21  0.708  1.37   2.52 ▁▃▆▇▆
22  0.476  1.42   2.68 ▁▂▇▆▅
23 -0.0766 1.19   2.36 ▁▇▇▆▅
24  0.0236 1.14   2.99 ▃▇▇▅▃

Let’s now explore the correlation of the variables. As you can see, all the variables are highly correlated with one another. This isn’t surprising since they all attempt to get at similar concepts. That is, when you infringe upon the civil liberties along one dimension, you usually also do so along another. Clustering in an of itself makes no parametric assumptions, so we don’t need to worry about lurking issues here, such as multicolinearity. That said, these data are likely prime candidates for a decomposition (see reading)!

vdems %>% 
  select(-country,-polity) %>%  
  GGally::ggcorr(.)
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

K Means Clustering

The aim here is to see if we can leverage the V-Dems civil liberties data to generate ad-hoc categories containing democracies and non-democracies. What is the value of doing this? Well, say we didn’t have any information on which observations were democracies and which were not (or we didn’t trust existing measures). We’ll unsupervised learning techniques, like K-Means clustering, offers us a way to classify clusters (i.e. groups, structures, etc.) in the data. These groups can correspond with theoretically relevant concepts (like democracy), or they can simply offer useful ways of clustering the information in the dataset to (a) generate new features to be used in a supervised machine learning task, and/or (b) explore the data. The way variables clump together (or don’t clump together) can sometimes offer interesting insights.

# Subset the data to only include the variables we are clustering
c_dat  <- vdems %>%  select(-country,-polity)

# Set a seed to reproduce results, as where you randomly start matters 
set.seed(1988)

# run the K Means clustering algorithm. Here 'centers' == k. 
kmean_cluster <- kmeans(c_dat,centers=2)
kmean_cluster
K-means clustering with 2 clusters of sizes 100, 77

Cluster means:
    v2cltort   v2clkill  v2clslavem  v2clslavef v2cltrnslw  v2clrspct
1 -0.6472049 -0.3528492 -0.02498549 -0.08285085 -0.5356396 -0.6596279
2  1.3337045  1.5623754  1.39904041  1.34071630  1.2877663  0.9942734
  v2clacjstm v2clacjstw  v2clacjust v2clsocgrp  v2cldiscm  v2cldiscw
1 -0.4948829 -0.5436113 -0.03226539 -0.3071647 -0.6647442 -0.6700085
2  1.3057142  1.3043362  1.41023181  1.1620475  1.3225157  1.2578235
  v2clacfree  v2clrelig  v2clfmove v2cldmovem v2cldmovew  v2clstown
1 -0.5420559 -0.2199273 -0.3047384 -0.1358756 -0.3404345 -0.4994183
2  1.2400200  1.2702497  1.3518833  1.2514456  1.3019754  0.5721136
  v2clprptym v2clprptyw  v2clgencl  v2clgeocl  v2clpolcl
1 -0.1914482 -0.2430414 -0.1292365 -0.6098815 -0.4403644
2  1.3489253  1.4439545  1.3101084  0.9887677  1.2580801

Clustering vector:
  [1] 1 1 1 1 2 2 2 2 1 1 1 2 2 2 2 1 1 2 2 2 1 2 1 1 1 1 2 1 1 1 2 1
 [33] 1 1 2 2 1 2 2 1 2 1 1 2 1 1 1 1 2 1 1 2 2 2 1 2 1 2 2 2 1 1 1 2
 [65] 1 1 2 2 2 1 1 1 2 2 2 1 2 2 1 2 1 2 1 2 1 2 2 1 1 1 2 2 1 1 1 1
 [97] 1 2 1 2 1 2 1 2 1 1 1 1 2 2 1 1 1 1 2 2 1 1 2 1 1 1 1 2 2 1 1 1
[129] 1 1 1 1 1 2 2 1 2 2 2 2 1 1 2 1 1 2 2 1 2 2 2 1 2 1 2 1 2 1 1 2
[161] 1 1 1 1 2 1 2 2 2 1 2 2 1 1 1 1 1

Within cluster sum of squares by cluster:
[1] 1700.127 1116.362
 (between_SS / total_SS =  49.3 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"    
[5] "tot.withinss" "betweenss"    "size"         "iter"        
[9] "ifault"      

As we can see above, the k-means algorithm is quick to run and it spits out a lot of interesting material.

  1. we can see the Within Cluster Sum of Squares. This captures the degree to which the cluster is homogeneous. The smaller this value is, the more the data points contained within the cluster resemble one another.
kmean_cluster$tot.withinss
[1] 2816.489
  1. We can see the between cluster sum of squares. This tells us how different each cluster is from the other clusters. Put differently, it’s a measure of heterogeneity between clusters.
kmean_cluster$betweenss
[1] 2736.799

In practice, we want the within-ness to be small, and the between-ness to be large. That is, each data point in a cluster really resembles its neighbors while each cluster is distinct from the other clusters.

This is more of an art than a science, but there is a clear way of trying to optimize this arrangement (see below).

  1. We can see that the k means algorithm spits out class designations. We’ll use these as the categories that we’ll classify each event.
kmean_cluster$cluster
  [1] 1 1 1 1 2 2 2 2 1 1 1 2 2 2 2 1 1 2 2 2 1 2 1 1 1 1 2 1 1 1 2 1
 [33] 1 1 2 2 1 2 2 1 2 1 1 2 1 1 1 1 2 1 1 2 2 2 1 2 1 2 2 2 1 1 1 2
 [65] 1 1 2 2 2 1 1 1 2 2 2 1 2 2 1 2 1 2 1 2 1 2 2 1 1 1 2 2 1 1 1 1
 [97] 1 2 1 2 1 2 1 2 1 1 1 1 2 2 1 1 1 1 2 2 1 1 2 1 1 1 1 2 2 1 1 1
[129] 1 1 1 1 1 2 2 1 2 2 2 2 1 1 2 1 1 2 2 1 2 2 2 1 2 1 2 1 2 1 1 2
[161] 1 1 1 1 2 1 2 2 2 1 2 2 1 1 1 1 1

We can actually turn this classification into a variable that we can use in a visualization, model, or data summary. Note that we convert to a factor so that R treats it as a categorical variable and not an ordered one. It’s always important to keep in mind that there is no intrinsic ordering in these categorical levels — that is, going from 1 to 2 doesn’t mean you’re increasing or decreasing.

vdems$k_clusters <- as.factor(kmean_cluster$cluster)

Specifically, let’s examine the difference in means between the clusters. We can see there are distinct divergences in the mean levels between the values in each of the two clusters.

vdems %>% 
  select(-country,-polity) %>% 
  group_by(k_clusters) %>% 
  summarize_all(mean) 

A better way to look at this is to examine the univariate distributions of the two cluster categories.

vdems %>% 
  
  # Let's reshape the data so that the imputes are in a "long" format
  pivot_longer(cols=v2cltort:v2clpolcl) %>% 
  
  # Let's now plot as separate density plots using facet wrap
  ggplot(aes(value,fill=k_clusters)) +
  geom_density(alpha=.5) +
  
  # Break up each plot by variable.
  facet_wrap(~name,scale='free',ncol=4) +
  theme(legend.position = "top")

NA

As we can see from the above, the algorithm does a fairly good job shuffling the data into two piles. We can see this play out in a simple scatter plot. Note that the clusters aren’t perfect: there does appear to be some cross-over, but the clusters appear stable.

vdems %>% 
  ggplot(aes(v2clkill,v2cltort,color=k_clusters)) +
  geom_point()

But did we capture democracy?

The big question now is whether we were successful or not in capturing the concept of democracy by using these data to cluster into two categories. First, let’s outline a few things to keep in mind:

  • An important thing to note is that it’s really hard to classify countries as democracies and non-democracies. Much ink has been split on the undertaking and clustering methods aren’t going to offer a magic wand that will do this perfectly.
  • We’re going to compare how well our clustering algorithm binned countries by looking at how it performed relative to the popular polity scale (which measures democracy on a -10 to 10 rating system). But noted, we often don’t have such a scale to measure the validity of our results. Else, why would we need to cluster, we could just use that existing classification system! This gets to the heart of why we call these methods unsupervised: there is no right or wrong answer, just interpretation.
  • The cluster categories don’t always have to “make sense” to be useful. Remember, much of supervised machine learning can operate under the paradigm of “throw it all in and see what does best”.

Let’s now compare our cluster categories to the polity scale. As we can see, it looks like our clusters did in fact broadly categorize democracies from non-democracies. Cluster 2 clearly contains countries that have a higher polity score (i.e. more democratic) than countries Cluster 1.

vdems %>% 
  ggplot(aes(k_clusters,polity)) +
  geom_boxplot()

Let’s look at the country membership into each category. As we can see, in Cluster 2 we capture countries like the United States, Finland, Canada, etc. and in Cluster 1 we capture states like Saudi Arabia, Russia, and North Korea. Note also, that the clusters aren’t perfect. There are some countries that fall lower on the polity scale that we captured in cluster 2 (e.g. Kazakhstan). Likewise, there are countries that were relatively high on the polity scale that were in Cluster 1 (e.ge. Turkey, South Africa).

vdems %>% 
  ggplot(aes(k_clusters,polity,color=k_clusters)) +
  geom_jitter(width=.3,show.legend = F) +
  ggrepel::geom_text_repel(aes(label=country),show.legend = F) +
  ggthemes::scale_color_gdocs()

The method isn’t perfect, but for something that took less than a second to run, not bad.

Optimizing K-means?

As a side note: one thing that we can do is to look for an “optimal” value of k by running the algorithm many times and plotting the within-ness and between-ness. There should be a point — an “elbow” as the literature goes — where one additional cluster brings very little return.

The following plots simply plots the within sum of squares and the between sum of squares as we increase the number of clusters.

between_sum_squares <- numeric()
within_sum_squares <- numeric()

# Run the algorithm for different values of k 
set.seed(1988)

for(k in c(1:10)){
  
  # For each k, calculate betweenss and tot.withinss
  between_sum_squares[k] <- kmeans(c_dat, centers=k)$betweenss
  within_sum_squares[k] <- kmeans(c_dat, centers=k)$tot.withinss
  
}

# Between-cluster sum of squares vs Choice of k
bss_plot <- qplot(1:10, between_sum_squares, geom=c("point", "line"), 
                  xlab="Number of clusters", ylab="Between-cluster sum of squares") +
  scale_x_continuous(breaks=seq(1, 10, 1)) +
  theme_bw()

# Total within-cluster sum of squares vs Choice of k
wss_plot <- qplot(1:10, within_sum_squares, geom=c("point", "line"),
                  xlab="Number of clusters", ylab="Total within-cluster sum of squares") +
  scale_x_continuous(breaks=seq(1, 10, 1)) +
  theme_bw()

# Subplot
gridExtra::grid.arrange(bss_plot, wss_plot, ncol=2)

As we can see from the above, after 4 clusters, we get very little by way of a return from adding an additional cluster. What’s interesting is that there appears to be something closer to 4 clusters in the data, so the data is potentially more nuanced than the two categories we just showcased.

Hierarchical Clustering

As we discussed in lecture, hierarchical clustering takes a very different approach to clustering data points. Rather than pre-specifying a value for k, we build the clusters organically from the bottom-up, using a concept of distance and tree structure.

First, note that we’ll need to convert our data into a pair-wise distance matrix, which in a single square matrix captures the euclidean distance of every observation to every other observation. The below code does this for us.

distance_mat <- dist(c_dat)
class(distance_mat) 
[1] "dist"

Recall that the concept of dissimilarity between pairs of observations (“linkage”) can be conceptualized in very different ways. Let’s explore how the hierarchical trees look different as one changes the concept of linkage.

hc_complete <- hclust(distance_mat, method="complete")
plot(hc_complete,main="Hierarchical Cluster - Complete Linkage")

hc_average <- hclust(distance_mat, method="average")
plot(hc_average,main="Hierarchical Cluster - Average Linkage")

hc_centroid=hclust(distance_mat, method="centroid")
plot(hc_centroid,main="Hierarchical Cluster - Centroid Linkage")

hc_single <- hclust(distance_mat, method="single")
plot(hc_single,main="Hierarchical Cluster - Single Linkage")

As we can see, very different stories emerge from each one. This isn’t a cause for alarm, but rather a call to investigation. How does differences in linkages change the types of clusters that emerge? Does this do a better or worse job? Do these decisions appear to reveal anything interesting from the data?

For example, when using centroid, average, and single linkage method, as single observation (114) stands out from the pack. What’s up with that observation? Let’s look.

vdems %>% slice(114)

Ah-ha! It’s North Korea. A clear outlier on the international stage. The hierarchical clustering method clearly noted the stark difference between North Korea and all the other countries in the data and that “distance” is captured in how observations were clustered.

So how did it do capturing democracy?

To get categories from a hierarchical clustering algorithm, you can do one of two things:

  1. “Trim” the tree at a certain height.
cutree(hc_complete,h=20)
  [1] 1 1 1 1 2 2 2 2 1 1 1 2 2 2 2 2 1 2 2 1 2 2 1 1 1 1 2 2 1 1 2 1
 [33] 1 1 2 2 1 2 2 1 2 1 1 1 1 1 1 1 2 1 1 2 2 2 1 2 1 2 2 2 1 1 1 2
 [65] 1 1 2 2 2 1 1 1 2 2 2 1 2 2 2 2 1 2 1 2 1 2 2 2 1 1 2 2 1 1 2 1
 [97] 1 2 1 2 1 2 2 2 1 1 1 1 2 2 1 2 1 1 2 2 1 1 2 2 1 1 1 2 2 1 1 1
[129] 1 1 1 2 1 2 2 1 2 2 2 2 1 1 2 1 1 2 2 1 2 2 2 1 2 1 2 1 2 1 1 2
[161] 1 1 1 1 2 1 2 2 2 1 2 2 1 1 1 1 1
  1. Ask for k number of clusters back, and the software will find the right height to cut the tree to return that result.
cutree(hc_complete,k=2)
  [1] 1 1 1 1 2 2 2 2 1 1 1 2 2 2 2 2 1 2 2 1 2 2 1 1 1 1 2 2 1 1 2 1
 [33] 1 1 2 2 1 2 2 1 2 1 1 1 1 1 1 1 2 1 1 2 2 2 1 2 1 2 2 2 1 1 1 2
 [65] 1 1 2 2 2 1 1 1 2 2 2 1 2 2 2 2 1 2 1 2 1 2 2 2 1 1 2 2 1 1 2 1
 [97] 1 2 1 2 1 2 2 2 1 1 1 1 2 2 1 2 1 1 2 2 1 1 2 2 1 1 1 2 2 1 1 1
[129] 1 1 1 2 1 2 2 1 2 2 2 2 1 1 2 1 1 2 2 1 2 2 2 1 2 1 2 1 2 1 1 2
[161] 1 1 1 1 2 1 2 2 2 1 2 2 1 1 1 1 1

Let’s do the following:

  • As before, let’s split the data into two cluster categories. Again, we’ll convert these categories two factors so R doesn’t get confused.
  • Let’s then compare those categories to polity and see if they capture our “democracy”/“non-democracy” concepts.
  • Let’s do this for every linkage method to see if there are any noticeable differences.
vdems$h_clust_complete <- as.factor(cutree(hc_complete,k=2))
vdems$h_clust_average <- as.factor(cutree(hc_average,k=2))
vdems$h_clust_centroid <- as.factor(cutree(hc_centroid ,k=2))
vdems$h_clust_single <- as.factor(cutree(hc_single,k=2))

Let’s now let’s plot.

vdems %>% 
  # Select the data, and spread the cluster categories long
  select(country,polity,contains("h_clust")) %>% 
  pivot_longer(cols=contains("h_clust"))  %>% 
  
  # Generate the box plots
  ggplot(aes(value,polity,fill=value,color=value)) +
  geom_boxplot(alpha=.7) +
  facet_wrap(~name,scales="free")

Wait, what’s going on here? Why are some of the categories so squished for a few of the methods? Recall that for many of the clustering algorithms there was North Korea and then every other country. This was true for all clustering algorithms but the complete method.

We could re-run this to then include more clusters, but then, we’ll be doing something slightly different than getting a raw measure for dems/non-dems.

For the h-clustering method that relies on complete linkage, we can see that we get a distinction similar to what we recovered from the k means algorithm.

Just for the sake of it, let’s increase k to 4. When we do so, we can see that those outlier categories prevail. This should make us think about what outliers mean for different linkage methods (and eventual interpretation) when using hierarchical clustering.

vdems$h_clust_complete <- as.factor(cutree(hc_complete,k=4))
vdems$h_clust_average <- as.factor(cutree(hc_average,k=4))
vdems$h_clust_centroid <- as.factor(cutree(hc_centroid ,k=4))
vdems$h_clust_single <- as.factor(cutree(hc_single,k=4))

vdems %>% 
  # Select the data, and spread the cluster categories long
  select(country,polity,contains("h_clust")) %>% 
  pivot_longer(cols=contains("h_clust"))  %>% 
  
  # Generate the box plots
  ggplot(aes(value,polity,fill=value,color=value)) +
  geom_boxplot(alpha=.7) +
  facet_wrap(~name,scales="free")

