import numpy as np
import pandas as pd
import requests
# Read in Visualization code from Github (requires bokeh module)
exec(requests.get('https://raw.githubusercontent.com/edunford/ppol564/master/lectures/visualization_library/visualize.py').content)
vla = LinearAlgebra # assign class to an simplier naming convention.
Given $\vec{a}, \vec{b} \in \Re^n$
$$ \vec{a} \cdot \vec{b} $$
$$ \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} \cdot \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_n \end{bmatrix}$$
$$ a_1 b_1 + a_2 b_2 + \dots + a_n b_n $$
$$ \vec{a} \cdot \vec{b} = \sum_{i=1}^n a_i b_i$$
The dot product between two column vectors produces a scalar ($c$).
# Computationally
a = np.array([1,2])
b = np.array([2,1])
# Three ways to take the dot product using numpy
print(a.dot(b))
# or
print(np.dot(a,b))
# or
print(a @ b)
Property | Expression |
---|---|
Communicative | $\vec{a} \cdot \vec{b} = \vec{b} \cdot \vec{a} $ |
Distributive | $ \vec{v} \cdot (\vec{a} + \vec{b}) = \vec{v} \cdot \vec{a} + \vec{v} \cdot \vec{b} $ |
Associative | $ c(\vec{a}) \cdot \vec{b} = c(\vec{a} \cdot \vec{b}) $ |
What is the length of $\vec{c}$?
In linear algebra, we'll use the notation $||c||$ to denote the length of a vector.
# Vector a
c = np.array([1,2])
# Plot the vector
plot = vla()
plot.graph()
plot.vector(c)
plot.show()
Recall the Pythagorean Theorem
$$ a^2 + b^2 = c^2 $$
If we can express the vector $\vec{c}$ as a triangle, then we can use the Pythagorean theorem to calculate the length of $\vec{c}$.
Now recall our discussion of "unit vectors"
i = np.array([1,0])
j = np.array([0,1])
(1*i) + (2*j)
# Create our scaled unit vectors
a = i
b = 2*j
plot.vector(a)
plot.change_origin(i)
plot.vector(b)
plot.show()
Again let's express the above in terms of the Pythagorean Theorem.
$$ \left\| a \right\|^2 + \left\| b \right\|^2 = \left\| c \right\|^2 $$
We can see that if we add the squared lengths of the two unit vectors and then take the square root of that sum, we will get the length of $\vec{c}$.
$$ \sqrt{\left\| a \right\|^2 + \left\| b \right\|^2} = \left\| c \right\| $$
a.dot(a) + b.dot(b) # = ||c||^2
np.sqrt(a.dot(a) + b.dot(b)) # = ||c||
What's critical to note here is that if we take the dot product of $\vec{c}$ with itself we get $||c||^2$, which offers a handy way of calculating the length of $\vec{c}$
$$ \vec{c} \cdot \vec{c} $$
$$ \begin{bmatrix} c_1 \\ c_2 \\ \vdots \\ c_n \end{bmatrix} \cdot \begin{bmatrix} c_1 \\ c_2 \\ \vdots \\ c_n \end{bmatrix}$$
$$ c_1 c_1 + c_2 c_2 + \dots + c_n c_n $$
$$ c_1^2 + c_2^2 + \dots + c_n^2 = \left\| c \right\|^2 $$
$$ \sqrt{\left\| c \right\|^2} = \left\| c \right\| $$
Put concisely,
$$ \sqrt{\vec{c} \cdot \vec{c}} = \left\| c \right\| $$
For example,
$$ \vec{c} = \begin{bmatrix} 1 \\ 2 \end{bmatrix} $$
$$ \begin{bmatrix} 1 \\ 2 \end{bmatrix} \cdot \begin{bmatrix} 1 \\ 2 \\ \end{bmatrix}$$
$$ 1(1) + 2(2) $$
$$ 1 + 4 $$
$$ \left\| c \right\|^2 = 5$$
$$ \sqrt{\left\| c \right\|^2} = \sqrt{5}$$
$$ \left\| c \right\| = 2.24 $$
# The above via numpy. Note that it matches our earlier calculation.
np.sqrt(c.dot(c))
Numpy offers a quick method for calculating the magnitude of a vector in its .linalg
sub-library.
np.linalg.norm(c)
a = np.array([4,1])
b = np.array([1,2])
plot.clear().graph(7)
plot.vector(a)
plot.vector(b)
plot.show()
Recall the Law of Cosines from trigonometry. Law of Cosines can be thought of as a generalized version of the Pythagorean Theorem.
$$ c^2 = a^2 + b^2 - 2ab\cos{\theta} $$
For example, when the angle ($\theta$) between $a$ and $b$ is 90 degrees, the $cos{\theta}$ equals $0$ and the equation reduces to the Pythagorean Theorem.
$$ c^2 = a^2 + b^2 - 0 $$
Considering the two vectors presented above, let's subtract the two to produce a third vector $\vec{c}$. This will give us a triangle.
plot.subtract_vectors(a,b)
plot.show()
Applying the law of cosigns to these vectors, something interesting pops out!
$$ \left\| c \right\|^2 = \left\| a \right\|^2 + \left\| b \right\|^2 - 2\left\| a \right\| \left\| b \right\|\cos{\theta} $$
$$ \left\| a - b \right\|^2 = \left\| a \right\|^2 + \left\| b \right\|^2 - 2\left\| a \right\| \left\| b \right\|\cos{\theta} $$
$$ (\vec{a} - \vec{b}) \cdot (\vec{a} - \vec{b}) = \left\| a \right\|^2 + \left\| b \right\|^2 - 2\left\| a \right\| \left\| b \right\|\cos{\theta} $$
$$ \vec{a}\vec{a} - 2(\vec{a} \cdot \vec{b}) + \vec{b}\vec{b} = \left\| a \right\|^2 + \left\| b \right\|^2 - 2\left\| a \right\| \left\| b \right\|\cos{\theta} $$
$$ \left\| a \right\|^2 - 2(\vec{a} \cdot \vec{b}) + \left\| b \right\|^2 = \left\| a \right\|^2 + \left\| b \right\|^2 - 2\left\| a \right\| \left\| b \right\|\cos{\theta} $$
$$ - 2(\vec{a} \cdot \vec{b}) = - 2\left\| a \right\| \left\| b \right\|\cos{\theta} $$
$$ \vec{a} \cdot \vec{b} = \left\| a \right\| \left\| b \right\|\cos{\theta} $$
In words, the dot product of two vectors is equal to the product of their lengths times the cosine of the angle between them.
An important rule to keep in mind: the sum of two sides of a triangle must always be greater than or equal to the third length. In linear algebra, we take this important property from trigonometry and apply it to $N$-dimensional space.
$$ \left\| \vec{a} + \vec{b} \right\| \le \left\| \vec{a} \right\| + \left\|\vec{b} \right\| $$
Again, when the angle between two vectors is 90 degrees (i.e. when the vectors are pointing in the opposite direction) the $\cos{\theta} = 0$
a = np.array([4,0])
b = np.array([0,5])
plot.clear().graph(10)
plot.vector(a)
plot.vector(b)
plot.show()
$$ \vec{a} \cdot \vec{b} = \left\| a \right\| \left\| b \right\|\cos{90}$$
$$ \vec{a} \cdot \vec{b} = \left\| a \right\| \left\| b \right\|0$$
$$ \vec{a} \cdot \vec{b} = 0 $$
np.dot(a,b)
This when we take the dot product between two vectors and they're corresponding dot product is 0, we know that the two vectors are orthogonal to one another.
a = np.array([4,1])
b = np.array([1,5])
plot.clear().graph(10)
plot.vector(a)
plot.vector(b)
plot.show()
np.dot(a,b)
Taking what we now know, we can easily calculate the angle between two vectors!
$$ \vec{a} \cdot \vec{b} = \left\| a \right\| \left\| b \right\|\cos{\theta}$$
$$ \cos{\theta} = \frac{\vec{a} \cdot \vec{b}}{\left\| a \right\| \left\| b \right\|}$$
# Let's define the above as a function:
def cosine(a,b):
cos = np.dot(a,b)/(np.sqrt(np.dot(a,a)) * np.sqrt(np.dot(b,b)) )
return cos
round(cosine(a,b),3)
# Let's reverse engineer this to get the dot product again.
np.sqrt(a.dot(a)) * np.sqrt(b.dot(b)) * cosine(a,b)
So what is the dot product between two vectors exactly?
We can think of the dot product of two vectors as the length of two vectors that moves in the same direction
Imagine we cast a vector down from the tip of $\vec{a}$ onto $\vec{b}$ such that the angle between that vector (which we'll call $\vec{v}$) and vector $\vec{b}$ is orthogonal (90 degrees).
plot.clear().graph(8)
plot.projection(a,b)
plot.show()
What is the size (magnitude) of the "shadow" cast by that vector onto $\vec{b}$?
$$ \vec{v} = \vec{a} - c\vec{b} $$
By design, $\vec{b} \cdot \vec{v} = 0$
$$ (\vec{a}-c\vec{b}) \cdot \vec{b} = 0 $$
$$ \vec{a} \cdot \vec{b} -c\vec{b} \cdot \vec{b} = 0 $$
$$ \vec{a} \cdot \vec{b} = c\vec{b} \cdot \vec{b} $$
$$ \frac{\vec{a} \cdot \vec{b}}{\vec{b} \cdot \vec{b}} = c $$
$$ c = \frac{\vec{a} \cdot \vec{b}}{\left\| b \right\|^2} $$
Thus, our "shadow" vector is merely a scaled version of $\vec{b}$
$$ shadow = c\vec{b}$$
$\vec{a}$ is moving in the direction of $\vec{b}$ by $c\vec{b}$
Applied: What is the length of the projection of $\vec{a}$ onto $\vec{b}$ ?
c = np.dot(a,b)/np.dot(b,b)
projection_vector = c*b
projection_vector
The projection is equal to the cosine if we normalize the vectors. That is, if we reset the vectors so that their lengths are equal to 1. This puts the vectors onto the unit circle.
To normalize a vector, we scale the vector by its length.
$$ \vec{a}_{norm} = \frac{1}{\left\| a \right\|} \vec{a} $$
where
$$ \left\| \vec{a}_{norm} \right\| = 1 $$
a_norm = 1/np.sqrt(np.dot(a,a))*a
b_norm = 1/np.sqrt(np.dot(b,b))*b
plot.clear().graph(2)
plot.projection(a_norm,b_norm)
plot.show()
c = np.dot(a_norm,b_norm)
c
cosine(a,b)
Consider the two statements: how similar are they?
descrip1 = "This is a speech given by current President Trump about Trump."
descrip2 = "This is a speech given by former President Obama about Trump."
To process these texts, we need to turn them into numerical data. One way to do this to break up the sentences into individual words (along the way we'll also scrub capitalization and punctuation)
def tokenize(text=None):
text = text.lower()
text = text.replace('.','')
text_list = text.split()
return text_list
# When applied we get a simplified version of the string
tokenize(descrip1)
Next we'll want to count the number of times a word occurs in the text.
d = dict()
for word in tokenize(descrip1):
if word in d:
d[word][0] += 1
else:
d[word] = [1]
d
Next we can represent this dictionary as a data frame, where the columns reflect the words in a text, and the index reflects the text id (i.e. where those words came from), and the cells reflect the number of times those words appear in a text. This is known as a document term matrix.
DTM = pd.DataFrame(d)
DTM
Let's wrap the above code in a function.
def convert_text_to_dtm(txt):
'''
Converts text into a document term matrix.
'''
d = dict()
for word in tokenize(txt):
if word in d:
d[word][0] += 1
else:
d[word] = [1]
return pd.DataFrame(d)
convert_text_to_dtm(descrip1)
Let's now build a function that processes many texts at once.
# Now build a function that does this for a list of texts
def gen_DTM(texts=None):
'''
Generate a document term matrix
'''
DTM = pd.DataFrame()
for text in texts:
entry = convert_text_to_dtm(text)
DTM = DTM.append(pd.DataFrame(entry),ignore_index=True,sort=True) # Row bind
DTM.fillna(0, inplace=True) # Fill in any missing values with 0s (i.e. when a word is in one text but not another)
return DTM
# Test it out!
gen_DTM([descrip1,descrip2])
How similar are these two statements?
D = gen_DTM([descrip1,descrip2])
# We can index the pandas dataframe to draw out a numpy array ( a vector! )
D.iloc[0].values
a = D.iloc[0].values
b = D.iloc[1].values
Let's use cosine similarity to understand the relationship between these two statements.
cosine(a,b)
Pretty similar!
Now, to really run home the intuition, how similar are these two statements?
docs = [
"On Saturday, Samantha likes to go shopping at the mall.",
"The results show that the marginal effect of x on y was trivial and overstated by the original authors."
]
D = gen_DTM(docs)
D
a = D.iloc[0].values
b = D.iloc[1].values
cosine(a,b)
Much less a alike! Actually the only real similarity between these two statements is the parts of speech. If we were to clean those out, we'd find that there was very little in common between these two statements. Let's do that!
Removing Stopwords
# Words common to the English Language
stopwords = ['on', 'to', 'go', 'at', 'the','that','of','was', 'and', 'by']
# Rewrite our token function to clean out these words
def tokenize(text=None):
text = text.lower()
text = text.replace('.','')
text_list = text.split()
text_list2 = [word for word in text_list if word not in stopwords]
return text_list2
print(tokenize(docs[0]))
print(tokenize(docs[1]))
D = gen_DTM(docs)
a = D.iloc[0].values
b = D.iloc[1].values
cosine(a,b)
The two statements are completely orthogonal! They go in completely different directions, substantively speaking.
Given this conceptualization, we could think of any document in this way! Our knowledge of vectors helps us make substantive comparisons between unstructured text. Pretty neat!