PPOL670 | Introduction to Data Science for Public Policy Week 13 Applications in Unsupervised Learning

# PPOL670 | Introduction to Data Science for Public Policy Week 13 Applications in Unsupervised Learning 
###  Prof. Eric Dunford  ◆  Georgetown University  ◆  McCourt School of Public Policy  ◆  <a href="mailto:eric.dunford@georgetown.edu" class="email">eric.dunford@georgetown.edu</a>

---

<div class="slide-footer"> 
PPOL670 | Introduction to Data Science for Public Policy

&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;

Week 13

&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;

Unsupervised Learning

</div>

---
class: outline

# Outline for Today

- **Refresh on the supervised and unsupervised distinction**

- **K-means Clustering**

- **Hierarchical Clustering**

---

# Supervised vs. Unsupervised

---

### Supervised Learning

- For each observation of the predictor measurement `$x_i$` there is an associated response measurement `$y_i$`. In essence, there is an _outcome_ we are aiming to accurately predict or understand.
 
- Use regression and classification methods
 
<img src="lecture-week-13-applications-unsupervised-ml-ppol670_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" />
 
 
---
 
### Unsupervised Learning

- We observe a vector of measurements `$x_i$` but _no_ associated response `$y_i$`.
  
- "unsupervised" because we lack a response variable that can supervise our analysis.

- Aim is to (a) data reduction and/or (b) data exploration (find "hidden" structures in the data).
 
<img src="lecture-week-13-applications-unsupervised-ml-ppol670_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" />
 
---

### Challenges of Unsupervised Learning

- Tends to be more **subjective**: there is no simple goal here

- **Difficult to assess** the results

- No way to "check our work": no universally accepted mechanism for performing cross-validation or validating results.

- Requires **exploration**

---

### K-Means Clustering

---

### Clustering Methods

![:space 5]

- **Clustering** refers to techniques for finding _subgroups_ in a data set.

- When we cluster the observations of a data set, we seek to partition them into distinct groups so that 
  
  - the observations **_within_** each group are **_similar_** to each other.
  
  - while observations in different groups are **_different_** from each other.

- Clustering looks to find homogeneous subgroups among the observations.

---

### K-means clustering

- **Aim**: partition a data set into `$k$` distinct, _non-overlapping_ clusters.

- An observation can only be member to one cluster.

- "Good" clustering is when the **_within-cluster variation is as small as possible_**

---

### K is arbitrary

---

### Minimize the total within-cluster variation

`$$\underset{C_1,\dots,C_k}{Min} \begin{Bmatrix}\sum^K_{k=1} W(C_k) \end{Bmatrix}$$`

Where

- `$k$` indexes the number of clusters
- `$C_k$` denotes a cluster 
- `$W(C_k)$` denotes the within-cluster variation

---

### Within-cluster variation?

First, what does it mean to be "close"?

Could use _squared Euclidean distance_

`$$(x_{ij} - x_{i'j})^2$$`

But "distance" can be conceptualized in many ways.

Second, define within-cluster variance

`$$W(C_k) = \frac{1}{|C_k|} \sum_{i,i' \in C_k}\sum^P_{j=1} (x_{ij} - x_{i'j})^2$$`

where `$|C_k|$` denotes the number of observations in the `$k$` cluster.

> Very difficult problem to solve because there are `$K^n$` ways to partition the data.

---

### K-means clustering algorithm

**Step 1**: Randomly assign a number, from 1 to `$K$`, to each of the observations.
These serve as initial cluster assignments for the observations.

**Step 2**: Iterate until the cluster assignments stop changing:

- **(a)** For each of the `$K$` clusters, compute the cluster centroid. The `$k$`th cluster centroid is the vector of the `$p$` feature means for the observations in the `$k$`th cluster.

- **(b)** Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance).

---

---

Because the K-means algorithm finds a **_local_** rather than a **_global optimum_**, the results obtained will depend on the initial (random) cluster assignment of each observation in Step 1.

---

### Hierarchical Clustering

---

### Hierarchical Clustering

- One downside of `$K$`-means is that it requires us to **_pre-specify_** the number of clusters `$K$`

- **Hierarchical clustering** is a _bottom-up_ clustering method that doesn't require use to define the number of clusters _ex anti_.

- H. clustering builds a tree (dendrogram) which treats every observation as unique and then combines clusters up to the trunk.

- We can determine the number of clusters by choosing a cutpoint (i.e. by "pruning the tree")

---

### Hierarchical Clustering

---

### Hierarchical Clustering

We can draw conclusions regarding the similarity of observations by proximity on the verticle axis, _not_ the horizontal axis.

---

### Hierarchical Clustering Algorithm

**Step 1**: Begin with n observations and a measure (such as Euclidean distance) of all the `$\frac{n(n − 1)}{2}$` pairwise dissimilarities. Treat each observation as its own cluster.

**Step 2**: For `$i=n,n−1,...,2$`:

- **(a)** Examine all pairwise inter-cluster dissimilarities among the `$i$` clusters and identify the pair of clusters that are least dissimilar (that is, most similar). Fuse these two clusters. The dissimilarity between these two clusters indicates the height in the dendrogram at which the fusion should be placed.

- **(b)** Compute the new pairwise **inter-cluster dissimilarities** among the `$i − 1$` remaining clusters.

---

---

### Defining "Inter-Cluster" Dissimilarity

The concept of dissimilarity between a pair of observations needs to be extended to a pair of **_groups of observations_**.

This is known as **"Linkage"**

|Linkage| Description|
|-------|----------|
| *Complete* | Maximal intercluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the _largest_ of these dissimilarities. |
| *Single* | Minimal intercluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the _smallest_ of these dissimilarities. Single linkage can result in extended, trailing clusters in which single observations are fused one-at-a-time. |
| *Average* | Mean intercluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the average of these dissimilarities. |
| *Centroid* | Dissimilarity between the centroid for cluster A (a mean vector of length p) and the centroid for cluster B. Centroid linkage can result in undesirable _inversions_. |

---

### Linkage matters

---

### Small Decisions, Big Consequences

In order to perform clustering, some decisions must be made:

- **_Should the observations or features first be standardized in some way?_**

- In the case of **hierarchical clustering**:
  
  - _What dissimilarity measure should be used?_
  
  - _What type of linkage should be used?_
  
  - _Where should we cut the dendrogram in order to obtain clusters?_

- In the case of **K-means clustering**:

- _how many clusters should we look for in the data?_