Review of two matrices that appear a regularly in applied data science and statistics:
import numpy as np
from numpy import linalg as la
import pandas as pd
import requests
# Read in Visualization code from Github (requires bokeh module)
exec(requests.get('https://raw.githubusercontent.com/edunford/ppol564/master/lectures/visualization_library/visualize.py').content)
vla = LinearAlgebra # assign class to an simplier naming convention.
plot = vla()
np.random.seed(123) # To reproduce simulation
N = 10 # Number of observations
P = 2 # Number of variables
X = np.random.normal(3,1,N*P).reshape(N,P).round(2) # Draw from a normal distribution
print('The dimensions of X\n',X.shape)
print('\nX\n',X)
print("Visual representation of our simulated data as vectors\n")
plot.graph(10)
plot.vec_w_point(X)
plot.show()
Covariance is the sum of the squared deviations of two (or more) variables from their respective the means.
$$ cov(x,y) = \frac{\sum^{N}_{i=1} (x_i - \bar{x})(y_i - \bar{y})}{N-1} = \Sigma $$
As we saw before, we can express the above equation in matrix form.
$$\Sigma = \frac{1}{N-1} (\textbf{X} - \bar{\textbf{X}})^T(\textbf{X} - \bar{\textbf{X}}) $$
Where $\textbf{X}_{n \times p}$ is our data matrix and $\frac{1}{N-1}$ is a scalar.
The mean is the central tendency of our vector of points. Spatially, we can think of the mean as the "centroid" (the central point within a scatter of points)
In vector form, $$\bar{x} = \frac{\sum_{i=1}^{N} x_i}{N} $$
In matrix form, $$ \bar{\textbf{x}} = \frac{1}{N}\textbf{1}\textbf{X}$$
X_means = (1/N)*(np.ones(N).dot(X)) # Means of each column vector
X_means
plot.vec_w_point(X_means,color="blue",alpha=.3)
plot.show()
$$ \textbf{X} - \bar{\textbf{x}}$$
X_dev = (X - X_means)
X_dev
plot.clear().graph()
plot.vec_w_point(X_dev,color="orange",alpha=.4)
plot.show()
We can see that that centering:
$$ \frac{1}{N-1} (\textbf{X} - \bar{\textbf{x}})^T(\textbf{X} - \bar{\textbf{x}}) $$
$$ \frac{1}{N-1} \textbf{X}_{centered}^T\textbf{X}_{centered}$$
$$\Sigma$$
squared_dev = X_dev.T.dot(X_dev)
sigma = squared_dev*(1/(N-1))
sigma
plot.clear().graph()
plot.vec_w_point(sigma,color="red",alpha=.3)
plot.show()
The distance in which we move in along the two axes, is the variance of our two variables. This information is captured along the diagonal axis.
sigma.diagonal()
var = np.diag(sigma.diagonal())
plot.vec_w_point(var,alpha=1)
plot.show()
The off diagonals capture the length of the projection of the two variable, i.e. the degree the two variables move in the same direction.
plot.clear().graph()
plot.vec_w_point(sigma,color="red",alpha=.3)
plot.projection(sigma[0],sigma[1])
plot.show()
N = X.shape[0] # Number of observations
X_means = (np.ones(N).dot(X))/N # Means of each column vector
X_dev = (X - X_means) # Deviations from the mean
squared_dev = X_dev.T.dot(X_dev) # Squared Deviations from the mean
sigma = (1/(N-1))*squared_dev # Average (with df adjustment)
sigma.shape
print("Our Variance-Covariance Matrix\n")
print(sigma.round(3))
np.cov(X.T).round(3)
A correlation matrix is similar to our variance-covariance matrix except that we standardize the deviations from the mean by dividing each variable by it's respective standard deviation. This effectively puts the deviations of each variable onto the same scale.
$$ corr(x,y) = \frac{1}{N-1}\frac{\sum^{N}_{i=1} (x_i - \bar{x})(y_i - \bar{y})}{\sigma_x \sigma_y} = \rho $$
In matrix form,
$$\rho = \frac{1}{N-1} \frac{(\textbf{X} - \bar{\textbf{x}})^T(\textbf{X} - \bar{\textbf{x}})}{\sqrt{diag(\Sigma)}} $$
Note: Recall that the diagonal of covariance matrix corresponds with the variances. When we take the square root of the variance, we get the standard deviation. $\sqrt{diag(\Sigma)}$ is merely the matrix expression of this calculation.
Like before, we are interested in the deviations from the mean...
plot.clear().graph()
plot.vec_w_point(X_dev,color="orange",alpha=.2)
plot.show()
But now were are going to standardize these deviations so that they are in terms of standard deviations (i.e. standard deviations from the mean)
X_std = X_dev/np.sqrt(sigma.diagonal())
X_std
plot.clear().graph()
plot.vec_w_point(X_std,color="green",alpha=.2)
plot.show()
Add this step again, and put it all together...
N = X.shape[0] # Number of observations
X_means = (np.ones(N).dot(X))/N # Means of each column vector
X_std = np.sqrt(sigma.diagonal()) # Standard deviations of each column vector
X_dev = (X - X_means) # Deviations from the mean
X_std = X_dev/X_std # Standardize those deviations from the mean
squared_dev = X_std.T.dot(X_std) # Squared Deviations from the mean
corr_matrix = (1/(N-1))*squared_dev # Average (with df adjustment)
# Print the Matrix
print("Our Correlation Matrix")
print(corr_matrix.round(3))
np.corrcoef(X.T).round(3)
We can think of the correlation is as the angle between the centered version of our two variables.
x = X[:,0]
y = X[:,1]
x = x - x.mean()
y = y - y.mean()
print("\nX and Y centered\n")
print(x)
print(y)
# Cosine equation
cosine = x.dot(y)/(la.norm(x)*la.norm(y))
print("\nAngle between the our centered versions of X and Y\n")
print(cosine.round(3))