# PPOL564 - Data Science I: Foundations

## Concepts covered:¶

Review of two matrices that appear a regularly in applied data science and statistics:

• variance-covariance matrix
• correlation matrix
In :
import numpy as np
from numpy import linalg as la
import pandas as pd
import requests

# Read in Visualization code from Github (requires bokeh module)
exec(requests.get('https://raw.githubusercontent.com/edunford/ppol564/master/lectures/visualization_library/visualize.py').content)
vla = LinearAlgebra # assign class to an simplier naming convention.
plot = vla()


## Simulate Data¶

In :
np.random.seed(123) # To reproduce simulation
N = 10              # Number of observations
P = 2               # Number of variables
X = np.random.normal(3,1,N*P).reshape(N,P).round(2) # Draw from a normal distribution

print('The dimensions of X\n',X.shape)
print('\nX\n',X)

The dimensions of X
(10, 2)

X
[[1.91 4.  ]
[3.28 1.49]
[2.42 4.65]
[0.57 2.57]
[4.27 2.13]
[2.32 2.91]
[4.49 2.36]
[2.56 2.57]
[5.21 5.19]
[4.   3.39]]

In :
print("Visual representation of our simulated data as vectors\n")
plot.graph(10)
plot.vec_w_point(X)
plot.show()

Visual representation of our simulated data as vectors



## Variance-Covariance Matrix¶

Covariance is the sum of the squared deviations of two (or more) variables from their respective the means.

$$cov(x,y) = \frac{\sum^{N}_{i=1} (x_i - \bar{x})(y_i - \bar{y})}{N-1} = \Sigma$$

As we saw before, we can express the above equation in matrix form.

$$\Sigma = \frac{1}{N-1} (\textbf{X} - \bar{\textbf{X}})^T(\textbf{X} - \bar{\textbf{X}})$$

Where $\textbf{X}_{n \times p}$ is our data matrix and $\frac{1}{N-1}$ is a scalar.

## Let's unpack this...¶

### Mean¶

The mean is the central tendency of our vector of points. Spatially, we can think of the mean as the "centroid" (the central point within a scatter of points)

In vector form, $$\bar{x} = \frac{\sum_{i=1}^{N} x_i}{N}$$

In matrix form, $$\bar{\textbf{x}} = \frac{1}{N}\textbf{1}\textbf{X}$$

In :
X_means = (1/N)*(np.ones(N).dot(X)) # Means of each column vector
X_means

Out:
array([3.103, 3.126])
In :
plot.vec_w_point(X_means,color="blue",alpha=.3)
plot.show()


### Deviations from the mean¶

$$\textbf{X} - \bar{\textbf{x}}$$

In :
X_dev = (X - X_means)
X_dev

Out:
array([[-1.193,  0.874],
[ 0.177, -1.636],
[-0.683,  1.524],
[-2.533, -0.556],
[ 1.167, -0.996],
[-0.783, -0.216],
[ 1.387, -0.766],
[-0.543, -0.556],
[ 2.107,  2.064],
[ 0.897,  0.264]])
In :
plot.clear().graph()
plot.vec_w_point(X_dev,color="orange",alpha=.4)
plot.show()


We can see that that centering:

1. Sets the mean (or central location) to 0.
2. The lengths of each vector represent the distance from the mean (i.e. that centroid location).

### Average squared deviations from the mean¶

$$\frac{1}{N-1} (\textbf{X} - \bar{\textbf{x}})^T(\textbf{X} - \bar{\textbf{x}})$$

$$\frac{1}{N-1} \textbf{X}_{centered}^T\textbf{X}_{centered}$$

$$\Sigma$$

In :
squared_dev = X_dev.T.dot(X_dev)
sigma = squared_dev*(1/(N-1))
sigma

Out:
array([[1.97497889, 0.20745778],
[0.20745778, 1.37071556]])
In :
plot.clear().graph()
plot.vec_w_point(sigma,color="red",alpha=.3)
plot.show()


The distance in which we move in along the two axes, is the variance of our two variables. This information is captured along the diagonal axis.

In :
sigma.diagonal()

Out:
array([1.97497889, 1.37071556])
In :
var = np.diag(sigma.diagonal())
plot.vec_w_point(var,alpha=1)
plot.show()


The off diagonals capture the length of the projection of the two variable, i.e. the degree the two variables move in the same direction.

In :
plot.clear().graph()
plot.vec_w_point(sigma,color="red",alpha=.3)
plot.projection(sigma,sigma)
plot.show()


#### All in one step¶

In :
N = X.shape                         # Number of observations
X_means = (np.ones(N).dot(X))/N        # Means of each column vector
X_dev = (X - X_means)                  # Deviations from the mean
squared_dev = X_dev.T.dot(X_dev)       # Squared Deviations from the mean
sigma = (1/(N-1))*squared_dev          # Average (with df adjustment)
sigma.shape

Out:
(2, 2)
In :
print("Our Variance-Covariance Matrix\n")
print(sigma.round(3))

Our Variance-Covariance Matrix

[[1.975 0.207]
[0.207 1.371]]


#### Using Numpy¶

In :
np.cov(X.T).round(3)

Out:
array([[1.975, 0.207],
[0.207, 1.371]])

## Correlation Matrix¶

A correlation matrix is similar to our variance-covariance matrix except that we standardize the deviations from the mean by dividing each variable by it's respective standard deviation. This effectively puts the deviations of each variable onto the same scale.

$$corr(x,y) = \frac{1}{N-1}\frac{\sum^{N}_{i=1} (x_i - \bar{x})(y_i - \bar{y})}{\sigma_x \sigma_y} = \rho$$

In matrix form,

$$\rho = \frac{1}{N-1} \frac{(\textbf{X} - \bar{\textbf{x}})^T(\textbf{X} - \bar{\textbf{x}})}{\sqrt{diag(\Sigma)}}$$

Note: Recall that the diagonal of covariance matrix corresponds with the variances. When we take the square root of the variance, we get the standard deviation. $\sqrt{diag(\Sigma)}$ is merely the matrix expression of this calculation.

### What this means?¶

Like before, we are interested in the deviations from the mean...

In :
plot.clear().graph()
plot.vec_w_point(X_dev,color="orange",alpha=.2)
plot.show()


But now were are going to standardize these deviations so that they are in terms of standard deviations (i.e. standard deviations from the mean)

In :
X_std = X_dev/np.sqrt(sigma.diagonal())
X_std

Out:
array([[-0.84890524,  0.74651367],
[ 0.12594822, -1.39736425],
[-0.48600359,  1.30170117],
[-1.80241155, -0.47489885],
[ 0.83040437, -0.85071809],
[-0.55716077, -0.18449308],
[ 0.98695018, -0.65426713],
[-0.38638353, -0.47489885],
[ 1.49928193,  1.76293387],
[ 0.63827997,  0.22549154]])
In :
plot.clear().graph()
plot.vec_w_point(X_std,color="green",alpha=.2)
plot.show()


Add this step again, and put it all together...

In :
N = X.shape                         # Number of observations
X_means = (np.ones(N).dot(X))/N        # Means of each column vector
X_std = np.sqrt(sigma.diagonal())      # Standard deviations of each column vector
X_dev = (X - X_means)                  # Deviations from the mean
X_std = X_dev/X_std                    # Standardize those deviations from the mean
squared_dev = X_std.T.dot(X_std)       # Squared Deviations from the mean
corr_matrix = (1/(N-1))*squared_dev    # Average (with df adjustment)

# Print the Matrix
print("Our Correlation Matrix")
print(corr_matrix.round(3))

Our Correlation Matrix
[[1.    0.126]
[0.126 1.   ]]


#### Using Numpy¶

In :
np.corrcoef(X.T).round(3)

Out:
array([[1.   , 0.126],
[0.126, 1.   ]])

We can think of the correlation is as the angle between the centered version of our two variables.

In :
x = X[:,0]
y = X[:,1]

x = x - x.mean()
y = y - y.mean()

print("\nX and Y centered\n")
print(x)
print(y)

# Cosine equation
cosine = x.dot(y)/(la.norm(x)*la.norm(y))

print("\nAngle between the our centered versions of X and Y\n")
print(cosine.round(3))

X and Y centered

[-1.193  0.177 -0.683 -2.533  1.167 -0.783  1.387 -0.543  2.107  0.897]
[ 0.874 -1.636  1.524 -0.556 -0.996 -0.216 -0.766 -0.556  2.064  0.264]

Angle between the our centered versions of X and Y

0.126