PPOL564 - Data Science I: Foundations

Lecture 17

Variance-Covariance and Correlation Matrices

Concepts covered:

Review of two matrices that appear a regularly in applied data science and statistics:

  • variance-covariance matrix
  • correlation matrix
In [1]:
import numpy as np
from numpy import linalg as la
import pandas as pd
import requests

# Read in Visualization code from Github (requires bokeh module)
exec(requests.get('https://raw.githubusercontent.com/edunford/ppol564/master/lectures/visualization_library/visualize.py').content)
vla = LinearAlgebra # assign class to an simplier naming convention.
plot = vla()
Loading BokehJS ...

Simulate Data

In [2]:
np.random.seed(123) # To reproduce simulation 
N = 10              # Number of observations
P = 2               # Number of variables
X = np.random.normal(3,1,N*P).reshape(N,P).round(2) # Draw from a normal distribution

print('The dimensions of X\n',X.shape)
print('\nX\n',X)
The dimensions of X
 (10, 2)

X
 [[1.91 4.  ]
 [3.28 1.49]
 [2.42 4.65]
 [0.57 2.57]
 [4.27 2.13]
 [2.32 2.91]
 [4.49 2.36]
 [2.56 2.57]
 [5.21 5.19]
 [4.   3.39]]
In [3]:
print("Visual representation of our simulated data as vectors\n")
plot.graph(10)
plot.vec_w_point(X)
plot.show()
Visual representation of our simulated data as vectors

Variance-Covariance Matrix

Covariance is the sum of the squared deviations of two (or more) variables from their respective the means.



$$ cov(x,y) = \frac{\sum^{N}_{i=1} (x_i - \bar{x})(y_i - \bar{y})}{N-1} = \Sigma $$



As we saw before, we can express the above equation in matrix form.



$$\Sigma = \frac{1}{N-1} (\textbf{X} - \bar{\textbf{X}})^T(\textbf{X} - \bar{\textbf{X}}) $$



Where $\textbf{X}_{n \times p}$ is our data matrix and $\frac{1}{N-1}$ is a scalar.

Let's unpack this...

Mean

The mean is the central tendency of our vector of points. Spatially, we can think of the mean as the "centroid" (the central point within a scatter of points)



In vector form, $$\bar{x} = \frac{\sum_{i=1}^{N} x_i}{N} $$



In matrix form, $$ \bar{\textbf{x}} = \frac{1}{N}\textbf{1}\textbf{X}$$



In [4]:
X_means = (1/N)*(np.ones(N).dot(X)) # Means of each column vector
X_means
Out[4]:
array([3.103, 3.126])
In [5]:
plot.vec_w_point(X_means,color="blue",alpha=.3)
plot.show()

Deviations from the mean

$$ \textbf{X} - \bar{\textbf{x}}$$

In [6]:
X_dev = (X - X_means)  
X_dev
Out[6]:
array([[-1.193,  0.874],
       [ 0.177, -1.636],
       [-0.683,  1.524],
       [-2.533, -0.556],
       [ 1.167, -0.996],
       [-0.783, -0.216],
       [ 1.387, -0.766],
       [-0.543, -0.556],
       [ 2.107,  2.064],
       [ 0.897,  0.264]])
In [7]:
plot.clear().graph()
plot.vec_w_point(X_dev,color="orange",alpha=.4)
plot.show()

We can see that that centering:

  1. Sets the mean (or central location) to 0.
  2. The lengths of each vector represent the distance from the mean (i.e. that centroid location).

Average squared deviations from the mean



$$ \frac{1}{N-1} (\textbf{X} - \bar{\textbf{x}})^T(\textbf{X} - \bar{\textbf{x}}) $$



$$ \frac{1}{N-1} \textbf{X}_{centered}^T\textbf{X}_{centered}$$



$$\Sigma$$



In [8]:
squared_dev = X_dev.T.dot(X_dev)
sigma = squared_dev*(1/(N-1))
sigma
Out[8]:
array([[1.97497889, 0.20745778],
       [0.20745778, 1.37071556]])
In [9]:
plot.clear().graph()
plot.vec_w_point(sigma,color="red",alpha=.3)
plot.show()

The distance in which we move in along the two axes, is the variance of our two variables. This information is captured along the diagonal axis.

In [10]:
sigma.diagonal()
Out[10]:
array([1.97497889, 1.37071556])
In [11]:
var = np.diag(sigma.diagonal())
plot.vec_w_point(var,alpha=1)
plot.show()

The off diagonals capture the length of the projection of the two variable, i.e. the degree the two variables move in the same direction.

In [12]:
plot.clear().graph()
plot.vec_w_point(sigma,color="red",alpha=.3)
plot.projection(sigma[0],sigma[1])
plot.show()

All in one step

In [13]:
N = X.shape[0]                         # Number of observations
X_means = (np.ones(N).dot(X))/N        # Means of each column vector
X_dev = (X - X_means)                  # Deviations from the mean
squared_dev = X_dev.T.dot(X_dev)       # Squared Deviations from the mean
sigma = (1/(N-1))*squared_dev          # Average (with df adjustment)
sigma.shape 
Out[13]:
(2, 2)
In [14]:
print("Our Variance-Covariance Matrix\n")
print(sigma.round(3))
Our Variance-Covariance Matrix

[[1.975 0.207]
 [0.207 1.371]]

Using Numpy

In [15]:
np.cov(X.T).round(3)
Out[15]:
array([[1.975, 0.207],
       [0.207, 1.371]])

Correlation Matrix

A correlation matrix is similar to our variance-covariance matrix except that we standardize the deviations from the mean by dividing each variable by it's respective standard deviation. This effectively puts the deviations of each variable onto the same scale.



$$ corr(x,y) = \frac{1}{N-1}\frac{\sum^{N}_{i=1} (x_i - \bar{x})(y_i - \bar{y})}{\sigma_x \sigma_y} = \rho $$



In matrix form,



$$\rho = \frac{1}{N-1} \frac{(\textbf{X} - \bar{\textbf{x}})^T(\textbf{X} - \bar{\textbf{x}})}{\sqrt{diag(\Sigma)}} $$



Note: Recall that the diagonal of covariance matrix corresponds with the variances. When we take the square root of the variance, we get the standard deviation. $\sqrt{diag(\Sigma)}$ is merely the matrix expression of this calculation.

What this means?

Like before, we are interested in the deviations from the mean...

In [16]:
plot.clear().graph()
plot.vec_w_point(X_dev,color="orange",alpha=.2)
plot.show()

But now were are going to standardize these deviations so that they are in terms of standard deviations (i.e. standard deviations from the mean)

In [17]:
X_std = X_dev/np.sqrt(sigma.diagonal())
X_std
Out[17]:
array([[-0.84890524,  0.74651367],
       [ 0.12594822, -1.39736425],
       [-0.48600359,  1.30170117],
       [-1.80241155, -0.47489885],
       [ 0.83040437, -0.85071809],
       [-0.55716077, -0.18449308],
       [ 0.98695018, -0.65426713],
       [-0.38638353, -0.47489885],
       [ 1.49928193,  1.76293387],
       [ 0.63827997,  0.22549154]])
In [18]:
plot.clear().graph()
plot.vec_w_point(X_std,color="green",alpha=.2)
plot.show()

Add this step again, and put it all together...

In [19]:
N = X.shape[0]                         # Number of observations
X_means = (np.ones(N).dot(X))/N        # Means of each column vector
X_std = np.sqrt(sigma.diagonal())      # Standard deviations of each column vector
X_dev = (X - X_means)                  # Deviations from the mean
X_std = X_dev/X_std                    # Standardize those deviations from the mean
squared_dev = X_std.T.dot(X_std)       # Squared Deviations from the mean
corr_matrix = (1/(N-1))*squared_dev    # Average (with df adjustment)

# Print the Matrix
print("Our Correlation Matrix")
print(corr_matrix.round(3))
Our Correlation Matrix
[[1.    0.126]
 [0.126 1.   ]]

Using Numpy

In [20]:
np.corrcoef(X.T).round(3)
Out[20]:
array([[1.   , 0.126],
       [0.126, 1.   ]])

Another way to think about this

We can think of the correlation is as the angle between the centered version of our two variables.

In [21]:
x = X[:,0] 
y = X[:,1] 

x = x - x.mean()
y = y - y.mean()

print("\nX and Y centered\n")
print(x)
print(y)

# Cosine equation
cosine = x.dot(y)/(la.norm(x)*la.norm(y))

print("\nAngle between the our centered versions of X and Y\n")
print(cosine.round(3))
X and Y centered

[-1.193  0.177 -0.683 -2.533  1.167 -0.783  1.387 -0.543  2.107  0.897]
[ 0.874 -1.636  1.524 -0.556 -0.996 -0.216 -0.766 -0.556  2.064  0.264]

Angle between the our centered versions of X and Y

0.126