PPOL564 | DS1: Foundations

Lecture 21

Gradient Descent

Concepts Covered:

  • Intuition of Leveraging Gradients
  • Gradient Descent on a linear outcome
In [74]:
import numpy as np
import sympy as sp
import statsmodels.formula.api as smf
import pandas as pd
from bokeh.plotting import figure, output_notebook, show
from bokeh.layouts import column, row, gridplot
output_notebook()
sp.init_printing(use_unicode=True)
Loading BokehJS ...

Intuition surrounding leveraging gradients

Define an example function

In [52]:
# Consider the following function...
z = sp.symbols('z')
sym_f = (z**2+1) + -sp.exp(z) - 20*sp.sin(z)
f = sp.lambdify(z,sym_f,'numpy')
df = sp.lambdify(z,sym_f.diff(),'numpy')
ddf = sp.lambdify(z,sym_f.diff().diff(),'numpy')
In [53]:
x = np.arange(-10,4,.1) # Generate a numerical range
p = figure(height=300,width=700)
p.line(x,f(x),line_width=3)
show(p)

Newton-Raphson Method

"Newton–Raphson method" is a method for finding successively better approximations to the roots (or zeroes) of a real-valued function (Wiki).

Goal is to find the $x^*$ to minimize $f(x)$.

As we seen, if $f(x)$ is convex, we need to located $x^*$ such that $f′ (x^∗) = 0$

Let $g(x) = f′(x)$. Take Taylor expansion about the optimum solution $x^∗$ (see reading on what a Taylor series expansion is but all in all it's just an polynomial approximation for the shape of a function at a given point):



$$g(x^∗ ) = g(x) + (x^∗ − x)g'(x) + \text{“negligible” higher order terms} $$



Because $g(x ∗ ) = f′(x^∗ ) = 0$,



$$ 0 \approx g(x) + (x^∗ − x)g′(x) \to x^∗ \approx x − \frac{h(x)}{h′(x)} = x - \frac{f'(x)}{f′′(x)} $$



wikipedia_nf_gif

In [56]:
x = -1 # Starting value
on = True
while on:
    new_x = x - (df(x)/ddf(x))
    if np.isclose(new_x,x,.001):
        if ddf(new_x)>0:
            ctype = "Minima"
        elif ddf(new_x)<0: 
            ctype = "Maxima"
        else:
            ctype = 'Inflection'
        print(f"Converged at {round(x,3)} ({ctype})")
        on = False
    else:
        print(f'current: {round(x,3)}, proposed: {round(new_x,3)}')
        x = new_x
current: -1, proposed: -1.867
current: -1.867, proposed: -1.754
current: -1.754, proposed: -1.756
Converged at -1.756 (Maxima)
In [57]:
x = np.arange(-10,4,.1) # Generate a numerical range
p = figure(height=300,width=700)
p.line(x,f(x),line_width=3,alpha=.3,legend="f(x)")
p.scatter(new_x,f(new_x),color="red",size=6,alpha=.4,legend="proposed x")
p.yaxis.axis_label = 'f(x)'
p.xaxis.axis_label = 'x'
show(p)

What happens when we:

  • change the sign of the proposal?
  • set different starting values?

Steepest Ascent/Descent

Here we'll only require information about the rate of change of the function. Let's select some initial location ($x$), then let's calculate the derivative at that location, then take a "step" in that direction.

In [8]:
prop_x = 1
step=.05
new_x = prop_x + step*df(prop_x)
new_x
Out[8]:
$$0.423783602708908$$

Let's do this repeatedly until the values don't appear to change.

In [80]:
prop_x = -2 # Starting value
step=.05
iters = 100
vals = []
for i in range(iters):
    new_x = prop_x - step*df(prop_x)
    if np.isclose(new_x,prop_x,.001):
        print(f'Converged on iteration {i} at value {round(new_x,3)}')
        vals.append([new_x,f(new_x)])
        break
    else:
        prop_x = new_x
        vals.append([prop_x,f(prop_x)])
vals = np.array(vals)
Converged on iteration 6 at value -4.27
In [81]:
x = np.arange(-10,4,.1) # Generate a numerical range
p = figure(height=300,width=700)
p.line(x,f(x),line_width=3,alpha=.3,legend="f(x)")
p.scatter(vals[:,0],vals[:,1],color="red",size=6,alpha=.4,legend="proposed x")
p.yaxis.axis_label = 'f(x)'
p.xaxis.axis_label = 'x'

p2 = figure(height=200,width=700)
p2.line(np.arange(i+1),vals[:,1],line_width=3,alpha=.6,color="red")
p2.xaxis.axis_label = 'Iterations'
p2.yaxis.axis_label = 'Proposed x'
show(column(p,p2))

What happens when we:

  • change the sign of the steps?
  • set different starting values?
  • change the step size?

How do we ensure we're locating the globals?

Gradient Descent

Linear Model (Continuous Outcomes)

Generate Fake Data

To know the expected results, let's simulate some data...

In [63]:
N = 100
x = np.random.normal(0,1,N)
x2 = np.random.normal(0,1,N)
y = 1 + 2*x + 3*x2 + np.random.normal(0,1,N)

# Incorporate as design matrix
X = np.column_stack([np.ones(x.shape),x,x2])
X[1:5,:]
Out[63]:
array([[ 1.        ,  0.4064541 , -0.97402348],
       [ 1.        ,  0.32210607, -1.0491106 ],
       [ 1.        , -0.05151772, -0.46483438],
       [ 1.        , -0.20420096, -0.49055989]])

Linear Model

The data generating process



$$ y = \beta_{0} + \beta_1 x_1 + \beta_2 x_2 + \epsilon_i$$



Our function that approximates this linear process



$$ \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_{1i} + \hat{\beta}_2 x_{2i} = X\hat{\beta} $$



Define the cost function

Here we'll leverage the information about the gradient to minimize the cost function. Cost Function in this context is defined as the mean squared error:



$$ cost(y,X,\beta) = \frac{1}{N}\sum_{i=1}^N (y_i - \hat{y}_i)^2 = \frac{e' e}{N}$$



Here is the cost function in algebraic terms...

In [64]:
def cost(y,X,B):
    '''
    Cost function is the average mean squared error.
    '''
    x1 = X[:,0]
    x2 = X[:,1]
    x3 = X[:,2]
    e = sum((y - (B[0] + B[1]*x2 + B[2]*x3))**2)
    return e/y.shape[0]
cost(y,X,B=np.array([1,2,3]))
Out[64]:
$$0.8021661709036804$$

And here it is in linear algebra terms...

In [65]:
def cost(y,X,B):
    '''
    Cost function is the average mean squared error.
    '''
    e = y - X.dot(B)
    n = e.shape
    return  e.dot(e)/n
cost(y,X,B=np.array([1,2,3]))
Out[65]:
array([0.80216617])

Calculate the gradient of the cost function

Using what we know of the chain rule and gradients...



$$ -\nabla cost = - \begin{pmatrix} \partial_{\beta_0}\\ \partial_{\beta_1}\\ \partial_{\beta_2} \end{pmatrix} =

  • \begin{pmatrix} 2(y_i-\beta_01 + \beta_1 x_{1i} + \beta_2 x_{2i})\\ 2 x_{1i}(y_i-\beta_01 + \beta_{1} x_{1i} + \beta_2 x_{2i})\\ 2 x_{2i}(y_i-\beta_01 + \beta_{1} x_{1i} + \beta_2 x_{2i}) \end{pmatrix} $$



In [66]:
def calc_gradients(y,X,B):
    '''
    Gradient of the cost function 
    '''
    n = y.shape[0]
    betas =[]
    betas.append(-sum( 2*(y - X.dot(B)) )/n)
    for i in range(1,B.shape[0]):
        betas.append(-sum( 2*X[:,i]*(y - X.dot(B)) )/n)
    return np.array(betas)

Preview how this works...

In [67]:
initial_B = np.array([1,2,3])
calc_gradients(y,X,initial_B)
Out[67]:
array([ 0.27480226, -0.23523127,  0.02184896])

As we did above, lets use the gradient to take steps in the right direction. We'll establish a "learning rate" that will determine how much we step in a specific direction.

The following code has a lot embellishments but there are only a few

In [82]:
np.random.seed(123)   # Seed for reproducibility
B = np.random.rand(3) # initial (random) betas
learning_rate = .1    # Learning rate
iterations = 50       # number of iterations
stats = []            # Let's save the state of the algorithm

for i in range(iterations):
    
    mse = cost(y,X,B) # Calculate the mean squared error

    # Print off every 5th iteration...
    if i % 5 == 0:
        print('iteration:',i,"Loss:",mse[0].round(2),'Coefs:',B.round(2))
        
    # The bit that matters! 
    gradients = calc_gradients(y,X,B)
    B = B - (gradients*learning_rate) # New betas
    
    # Record values
    stats.append(np.hstack([mse,B]))
        
# Convert to 
stats = np.array(stats)
iteration: 0 Loss: 12.22 Coefs: [0.7  0.29 0.23]
iteration: 5 Loss: 1.86 Coefs: [0.77 1.67 2.04]
iteration: 10 Loss: 0.88 Coefs: [0.82 2.01 2.66]
iteration: 15 Loss: 0.78 Coefs: [0.84 2.09 2.86]
iteration: 20 Loss: 0.77 Coefs: [0.85 2.1  2.94]
iteration: 25 Loss: 0.77 Coefs: [0.85 2.11 2.96]
iteration: 30 Loss: 0.77 Coefs: [0.86 2.11 2.97]
iteration: 35 Loss: 0.77 Coefs: [0.86 2.11 2.97]
iteration: 40 Loss: 0.77 Coefs: [0.86 2.11 2.97]
iteration: 45 Loss: 0.77 Coefs: [0.86 2.11 2.98]

Visualize

In [83]:
# Generate plot
p = figure(height=400,width=400)
p.xaxis.axis_label = "Iterations"
p.yaxis.axis_label = "Mean Squared Error"
p.line(np.arange(stats.shape[0]),stats[:,0],line_width=3)

p2 = figure(height=400,width=400)
p2.xaxis.axis_label = "Iterations"
p2.yaxis.axis_label = "Coeffients"
p2.line(np.arange(stats.shape[0]),stats[:,1],color="red",line_width=3,legend="B0")
p2.line(np.arange(stats.shape[0]),stats[:,2],color="orange",line_width=3,legend="B1")
p2.line(np.arange(stats.shape[0]),stats[:,3],color="pink",line_width=3,legend="B2")

# Print plot
show(row(p,p2))
In [84]:
# Our betas retrieved through gradient descent. 
B
Out[84]:
array([0.85798737, 2.10784267, 2.97526645])

Compare our model fit to stats model output.

In [85]:
model = smf.ols('y ~ x1 + x2', data=pd.DataFrame(dict(y=y,x1=X[:,1],x2 = X[:,2]))).fit()
model.summary()
Out[85]:
OLS Regression Results
Dep. Variable: y R-squared: 0.948
Model: OLS Adj. R-squared: 0.947
Method: Least Squares F-statistic: 887.7
Date: Mon, 18 Nov 2019 Prob (F-statistic): 4.43e-63
Time: 13:30:28 Log-Likelihood: -128.81
No. Observations: 100 AIC: 263.6
Df Residuals: 97 BIC: 271.4
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 0.8580 0.089 9.615 0.000 0.681 1.035
x1 2.1078 0.083 25.260 0.000 1.942 2.273
x2 2.9754 0.092 32.418 0.000 2.793 3.158
Omnibus: 0.350 Durbin-Watson: 1.961
Prob(Omnibus): 0.839 Jarque-Bera (JB): 0.104
Skew: -0.062 Prob(JB): 0.949
Kurtosis: 3.097 Cond. No. 1.13


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.