PPOL564 - Data Science I: Foundations

Lecture 12

Trigonometry of Vectors

Concepts For today:

  • Vector Dot Product
  • Orthogonality
  • Projection
  • Normalizing vectors
  • Example: comparing the texts
In [1]:
import numpy as np
import pandas as pd
import requests

# Read in Visualization code from Github (requires bokeh module)
exec(requests.get('https://raw.githubusercontent.com/edunford/ppol564/master/lectures/visualization_library/visualize.py').content)
vla = LinearAlgebra # assign class to an simplier naming convention.
Loading BokehJS ...

Vector Multiplication (Vector Dot Product)

Given $\vec{a}, \vec{b} \in \Re^n$

$$ \vec{a} \cdot \vec{b} $$

$$ \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} \cdot \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_n \end{bmatrix}$$

$$ a_1 b_1 + a_2 b_2 + \dots + a_n b_n $$

$$ \vec{a} \cdot \vec{b} = \sum_{i=1}^n a_i b_i$$

The dot product between two column vectors produces a scalar ($c$).

In [2]:
# Computationally 
a = np.array([1,2])
b = np.array([2,1])

# Three ways to take the dot product using numpy
print(a.dot(b))

# or 

print(np.dot(a,b))

# or

print(a @ b)
4
4
4

Properties

Property Expression
Communicative $\vec{a} \cdot \vec{b} = \vec{b} \cdot \vec{a} $
Distributive $ \vec{v} \cdot (\vec{a} + \vec{b}) = \vec{v} \cdot \vec{a} + \vec{v} \cdot \vec{b} $
Associative $ c(\vec{a}) \cdot \vec{b} = c(\vec{a} \cdot \vec{b}) $

Magnitude (Length) of a Vector

What is the length of $\vec{c}$?

In linear algebra, we'll use the notation $||c||$ to denote the length of a vector.

In [3]:
# Vector a
c = np.array([1,2])

# Plot the vector
plot = vla()
plot.graph()
plot.vector(c)
plot.show()

Recall the Pythagorean Theorem

$$ a^2 + b^2 = c^2 $$

If we can express the vector $\vec{c}$ as a triangle, then we can use the Pythagorean theorem to calculate the length of $\vec{c}$.

Now recall our discussion of "unit vectors"

In [4]:
i = np.array([1,0])
j = np.array([0,1])

(1*i) + (2*j) 
Out[4]:
array([1, 2])
In [5]:
# Create our scaled unit vectors
a = i
b = 2*j

plot.vector(a)
plot.change_origin(i)
plot.vector(b)
plot.show()

Again let's express the above in terms of the Pythagorean Theorem.

$$ \left\| a \right\|^2 + \left\| b \right\|^2 = \left\| c \right\|^2 $$

We can see that if we add the squared lengths of the two unit vectors and then take the square root of that sum, we will get the length of $\vec{c}$.

$$ \sqrt{\left\| a \right\|^2 + \left\| b \right\|^2} = \left\| c \right\| $$

In [6]:
a.dot(a) + b.dot(b) # = ||c||^2
Out[6]:
5
In [7]:
np.sqrt(a.dot(a) + b.dot(b)) # = ||c||
Out[7]:
2.23606797749979

What's critical to note here is that if we take the dot product of $\vec{c}$ with itself we get $||c||^2$, which offers a handy way of calculating the length of $\vec{c}$

$$ \vec{c} \cdot \vec{c} $$


$$ \begin{bmatrix} c_1 \\ c_2 \\ \vdots \\ c_n \end{bmatrix} \cdot \begin{bmatrix} c_1 \\ c_2 \\ \vdots \\ c_n \end{bmatrix}$$


$$ c_1 c_1 + c_2 c_2 + \dots + c_n c_n $$


$$ c_1^2 + c_2^2 + \dots + c_n^2 = \left\| c \right\|^2 $$


$$ \sqrt{\left\| c \right\|^2} = \left\| c \right\| $$


Put concisely,

$$ \sqrt{\vec{c} \cdot \vec{c}} = \left\| c \right\| $$

For example,

$$ \vec{c} = \begin{bmatrix} 1 \\ 2 \end{bmatrix} $$

$$ \begin{bmatrix} 1 \\ 2 \end{bmatrix} \cdot \begin{bmatrix} 1 \\ 2 \\ \end{bmatrix}$$

$$ 1(1) + 2(2) $$

$$ 1 + 4 $$

$$ \left\| c \right\|^2 = 5$$

$$ \sqrt{\left\| c \right\|^2} = \sqrt{5}$$

$$ \left\| c \right\| = 2.24 $$

In [8]:
# The above via numpy. Note that it matches our earlier calculation.
np.sqrt(c.dot(c))
Out[8]:
2.23606797749979

Numpy offers a quick method for calculating the magnitude of a vector in its .linalg sub-library.

In [9]:
np.linalg.norm(c)
Out[9]:
2.23606797749979

Angles between Vectors

In [10]:
a = np.array([4,1])
b = np.array([1,2])

plot.clear().graph(7)
plot.vector(a)
plot.vector(b)
plot.show()

Law of Cosines

Recall the Law of Cosines from trigonometry. Law of Cosines can be thought of as a generalized version of the Pythagorean Theorem.

$$ c^2 = a^2 + b^2 - 2ab\cos{\theta} $$

For example, when the angle ($\theta$) between $a$ and $b$ is 90 degrees, the $cos{\theta}$ equals $0$ and the equation reduces to the Pythagorean Theorem.

$$ c^2 = a^2 + b^2 - 0 $$

Considering the two vectors presented above, let's subtract the two to produce a third vector $\vec{c}$. This will give us a triangle.

In [11]:
plot.subtract_vectors(a,b)
plot.show()

Applying the law of cosigns to these vectors, something interesting pops out!


$$ \left\| c \right\|^2 = \left\| a \right\|^2 + \left\| b \right\|^2 - 2\left\| a \right\| \left\| b \right\|\cos{\theta} $$


$$ \left\| a - b \right\|^2 = \left\| a \right\|^2 + \left\| b \right\|^2 - 2\left\| a \right\| \left\| b \right\|\cos{\theta} $$


$$ (\vec{a} - \vec{b}) \cdot (\vec{a} - \vec{b}) = \left\| a \right\|^2 + \left\| b \right\|^2 - 2\left\| a \right\| \left\| b \right\|\cos{\theta} $$


$$ \vec{a}\vec{a} - 2(\vec{a} \cdot \vec{b}) + \vec{b}\vec{b} = \left\| a \right\|^2 + \left\| b \right\|^2 - 2\left\| a \right\| \left\| b \right\|\cos{\theta} $$


$$ \left\| a \right\|^2 - 2(\vec{a} \cdot \vec{b}) + \left\| b \right\|^2 = \left\| a \right\|^2 + \left\| b \right\|^2 - 2\left\| a \right\| \left\| b \right\|\cos{\theta} $$


$$ - 2(\vec{a} \cdot \vec{b}) = - 2\left\| a \right\| \left\| b \right\|\cos{\theta} $$


$$ \vec{a} \cdot \vec{b} = \left\| a \right\| \left\| b \right\|\cos{\theta} $$

In words, the dot product of two vectors is equal to the product of their lengths times the cosine of the angle between them.

Triangle Inequality

An important rule to keep in mind: the sum of two sides of a triangle must always be greater than or equal to the third length. In linear algebra, we take this important property from trigonometry and apply it to $N$-dimensional space.

$$ \left\| \vec{a} + \vec{b} \right\| \le \left\| \vec{a} \right\| + \left\|\vec{b} \right\| $$

Orthogonal Vectors

Again, when the angle between two vectors is 90 degrees (i.e. when the vectors are pointing in the opposite direction) the $\cos{\theta} = 0$

In [12]:
a = np.array([4,0])
b = np.array([0,5])

plot.clear().graph(10)
plot.vector(a)
plot.vector(b)
plot.show()

$$ \vec{a} \cdot \vec{b} = \left\| a \right\| \left\| b \right\|\cos{90}$$

$$ \vec{a} \cdot \vec{b} = \left\| a \right\| \left\| b \right\|0$$

$$ \vec{a} \cdot \vec{b} = 0 $$

In [13]:
np.dot(a,b)
Out[13]:
0

This when we take the dot product between two vectors and they're corresponding dot product is 0, we know that the two vectors are orthogonal to one another.

In [14]:
a = np.array([4,1])
b = np.array([1,5])

plot.clear().graph(10)
plot.vector(a)
plot.vector(b)
plot.show()
In [15]:
np.dot(a,b)
Out[15]:
9

Calculating the cosine

Taking what we now know, we can easily calculate the angle between two vectors!


$$ \vec{a} \cdot \vec{b} = \left\| a \right\| \left\| b \right\|\cos{\theta}$$


$$ \cos{\theta} = \frac{\vec{a} \cdot \vec{b}}{\left\| a \right\| \left\| b \right\|}$$

In [16]:
# Let's define the above as a function:

def cosine(a,b):
    cos = np.dot(a,b)/(np.sqrt(np.dot(a,a)) * np.sqrt(np.dot(b,b))  )
    return cos
In [17]:
round(cosine(a,b),3)
Out[17]:
0.428
In [18]:
# Let's reverse engineer this to get the dot product again.
np.sqrt(a.dot(a)) * np.sqrt(b.dot(b)) * cosine(a,b)
Out[18]:
9.0

Dot product as a projection

So what is the dot product between two vectors exactly?

We can think of the dot product of two vectors as the length of two vectors that moves in the same direction

Imagine we cast a vector down from the tip of $\vec{a}$ onto $\vec{b}$ such that the angle between that vector (which we'll call $\vec{v}$) and vector $\vec{b}$ is orthogonal (90 degrees).

In [19]:
plot.clear().graph(8)
plot.projection(a,b)
plot.show()

What is the size (magnitude) of the "shadow" cast by that vector onto $\vec{b}$?


$$ \vec{v} = \vec{a} - c\vec{b} $$


By design, $\vec{b} \cdot \vec{v} = 0$


$$ (\vec{a}-c\vec{b}) \cdot \vec{b} = 0 $$


$$ \vec{a} \cdot \vec{b} -c\vec{b} \cdot \vec{b} = 0 $$


$$ \vec{a} \cdot \vec{b} = c\vec{b} \cdot \vec{b} $$


$$ \frac{\vec{a} \cdot \vec{b}}{\vec{b} \cdot \vec{b}} = c $$


$$ c = \frac{\vec{a} \cdot \vec{b}}{\left\| b \right\|^2} $$


Thus, our "shadow" vector is merely a scaled version of $\vec{b}$


$$ shadow = c\vec{b}$$


$\vec{a}$ is moving in the direction of $\vec{b}$ by $c\vec{b}$

Applied: What is the length of the projection of $\vec{a}$ onto $\vec{b}$ ?

In [20]:
c = np.dot(a,b)/np.dot(b,b)
projection_vector = c*b

projection_vector
Out[20]:
array([0.34615385, 1.73076923])

What is this really?

The projection is equal to the cosine if we normalize the vectors. That is, if we reset the vectors so that their lengths are equal to 1. This puts the vectors onto the unit circle.

To normalize a vector, we scale the vector by its length.


$$ \vec{a}_{norm} = \frac{1}{\left\| a \right\|} \vec{a} $$

where

$$ \left\| \vec{a}_{norm} \right\| = 1 $$

In [21]:
a_norm = 1/np.sqrt(np.dot(a,a))*a
b_norm = 1/np.sqrt(np.dot(b,b))*b
In [22]:
plot.clear().graph(2)
plot.projection(a_norm,b_norm)
plot.show()
In [23]:
c = np.dot(a_norm,b_norm)
c
Out[23]:
0.4280863447390447
In [24]:
cosine(a,b)
Out[24]:
0.4280863447390447

Applied Example: cosine similarity

Consider the two statements: how similar are they?

In [25]:
descrip1 = "This is a speech given by current President Trump about Trump."
descrip2 = "This is a speech given by former President Obama about Trump."

To process these texts, we need to turn them into numerical data. One way to do this to break up the sentences into individual words (along the way we'll also scrub capitalization and punctuation)

In [26]:
def tokenize(text=None):
    text = text.lower()
    text = text.replace('.','')
    text_list = text.split()
    return text_list

# When applied we get a simplified version of the string
tokenize(descrip1)
Out[26]:
['this',
 'is',
 'a',
 'speech',
 'given',
 'by',
 'current',
 'president',
 'trump',
 'about',
 'trump']

Next we'll want to count the number of times a word occurs in the text.

In [27]:
d = dict()
for word in tokenize(descrip1):
    if word in d:
        d[word][0] += 1
    else:
        d[word] = [1]
d
Out[27]:
{'this': [1],
 'is': [1],
 'a': [1],
 'speech': [1],
 'given': [1],
 'by': [1],
 'current': [1],
 'president': [1],
 'trump': [2],
 'about': [1]}

Next we can represent this dictionary as a data frame, where the columns reflect the words in a text, and the index reflects the text id (i.e. where those words came from), and the cells reflect the number of times those words appear in a text. This is known as a document term matrix.

In [28]:
DTM = pd.DataFrame(d)
DTM
Out[28]:
this is a speech given by current president trump about
0 1 1 1 1 1 1 1 1 2 1

Let's wrap the above code in a function.

In [29]:
def convert_text_to_dtm(txt):
    '''
    Converts text into a document term matrix.
    '''
    d = dict()
    for word in tokenize(txt):
        if word in d:
            d[word][0] += 1
        else:
            d[word] = [1]
    return pd.DataFrame(d)

convert_text_to_dtm(descrip1)
Out[29]:
this is a speech given by current president trump about
0 1 1 1 1 1 1 1 1 2 1

Let's now build a function that processes many texts at once.

In [30]:
# Now build a function that does this for a list of texts
def gen_DTM(texts=None):
    '''
    Generate a document term matrix
    '''
    DTM = pd.DataFrame()
    for text in texts:
        entry = convert_text_to_dtm(text)
        DTM = DTM.append(pd.DataFrame(entry),ignore_index=True,sort=True) # Row bind
    
    DTM.fillna(0, inplace=True) # Fill in any missing values with 0s (i.e. when a word is in one text but not another)
    return DTM

# Test it out!        
gen_DTM([descrip1,descrip2]) 
Out[30]:
a about by current former given is obama president speech this trump
0 1 1 1 1.0 0.0 1 1 0.0 1 1 1 2
1 1 1 1 0.0 1.0 1 1 1.0 1 1 1 1

How similar are these two statements?

In [31]:
D = gen_DTM([descrip1,descrip2]) 

# We can index the pandas dataframe to draw out a numpy array ( a vector! )
D.iloc[0].values
Out[31]:
array([1., 1., 1., 1., 0., 1., 1., 0., 1., 1., 1., 2.])
In [32]:
a = D.iloc[0].values
b = D.iloc[1].values

Let's use cosine similarity to understand the relationship between these two statements.

In [33]:
cosine(a,b)
Out[33]:
0.8362420100070909

Pretty similar!

Now, to really run home the intuition, how similar are these two statements?

In [34]:
docs = [
    "On Saturday, Samantha likes to go shopping at the mall.",
    "The results show that the marginal effect of x on y was trivial and overstated by the original authors."
]
In [35]:
D = gen_DTM(docs)
D 
Out[35]:
and at authors by effect go likes mall marginal of ... saturday, shopping show that the to trivial was x y
0 0.0 1.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 ... 1.0 1.0 0.0 0.0 1 1.0 0.0 0.0 0.0 0.0
1 1.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 ... 0.0 0.0 1.0 1.0 3 0.0 1.0 1.0 1.0 1.0

2 rows × 25 columns

In [36]:
a = D.iloc[0].values
b = D.iloc[1].values
cosine(a,b) 
Out[36]:
0.25298221281347033

Much less a alike! Actually the only real similarity between these two statements is the parts of speech. If we were to clean those out, we'd find that there was very little in common between these two statements. Let's do that!

Removing Stopwords

In [37]:
# Words common to the English Language
stopwords = ['on', 'to', 'go', 'at', 'the','that','of','was', 'and', 'by']
In [38]:
# Rewrite our token function to clean out these words
def tokenize(text=None):
    text = text.lower()
    text = text.replace('.','')
    text_list = text.split()
    text_list2 = [word for word in text_list if word not in stopwords]
    return text_list2

print(tokenize(docs[0]))
print(tokenize(docs[1]))
['saturday,', 'samantha', 'likes', 'shopping', 'mall']
['results', 'show', 'marginal', 'effect', 'x', 'y', 'trivial', 'overstated', 'original', 'authors']
In [39]:
D = gen_DTM(docs)
a = D.iloc[0].values
b = D.iloc[1].values
cosine(a,b) 
Out[39]:
0.0

The two statements are completely orthogonal! They go in completely different directions, substantively speaking.

Given this conceptualization, we could think of any document in this way! Our knowledge of vectors helps us make substantive comparisons between unstructured text. Pretty neat!