PPOL564 | Data Science 1 | Foundations

Statistical Learning Walkthrough


Comparing and Tuning Models

Dependencies

In [1]:
# Data Management/Investigation
import pandas as pd
import numpy as np
import missingno as miss
from plotnine import *
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

# For pre-processing data 
from sklearn import preprocessing as pp 
from sklearn.compose import ColumnTransformer 

# For splits and CV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold # Cross validation 
from sklearn.model_selection import cross_validate # Cross validation 
from sklearn.model_selection import GridSearchCV # Cross validation + param. tuning.

# Machine learning methods 
from sklearn.linear_model import LinearRegression as LM
from sklearn.neighbors import KNeighborsRegressor as KNN
from sklearn.tree import DecisionTreeRegressor as DTree
from sklearn import tree # For plotting the decision tree rules
from sklearn.ensemble import BaggingRegressor as Bag
from sklearn.ensemble import RandomForestRegressor as RF

# For evaluating our model's performance
import sklearn.metrics as m

# Pipeline to combine modeling elements
from sklearn.pipeline import Pipeline

Data

DC Housing Price data.

In [2]:
# Data on Housing Prices in DC
dat = pd.read_csv("dc_housing_prices.csv")
dat.shape
Out[2]:
(107154, 39)
In [3]:
# Available features
list(dat)
Out[3]:
['OBJECTID',
 'SSL',
 'BATHRM',
 'HF_BATHRM',
 'HEAT',
 'HEAT_D',
 'AC',
 'NUM_UNITS',
 'ROOMS',
 'BEDRM',
 'AYB',
 'YR_RMDL',
 'EYB',
 'STORIES',
 'SALEDATE',
 'PRICE',
 'QUALIFIED',
 'SALE_NUM',
 'GBA',
 'BLDG_NUM',
 'STYLE',
 'STYLE_D',
 'STRUCT',
 'STRUCT_D',
 'GRADE',
 'GRADE_D',
 'CNDTN',
 'CNDTN_D',
 'EXTWALL',
 'EXTWALL_D',
 'ROOF',
 'ROOF_D',
 'INTWALL',
 'INTWALL_D',
 'KITCHENS',
 'FIREPLACES',
 'USECODE',
 'LANDAREA',
 'GIS_LAST_MOD_DTTM']

Look at missingness in the data: Issue we're missing information on the outcome PRICE

In [4]:
miss.matrix(dat)
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f81582c6e10>

Let's do the following:

  • Only retain data values where the outcome is measured.
  • Only retain observations where there is a selling price (i.e. when PRICE == 0, the house hasn't been sold. We're interested in what predict the sell price given that it was sold.)
  • Subsample the data. We'll only look at some of this data for now (doing this so models will run faster, you can run the rest of the data in the coding discussion.)
In [5]:
# Drop all observations where the outcome is missing 
dat = dat[~dat.PRICE.isna()]

# Only houses that are sold
dat = dat[dat.PRICE > 0]

# Subsample
dat = dat.sample(1000,random_state=1988)

Set up data in the SKLEARN framework (i.e. outcome separated from the predictors).

Note: I'm not going use all the data for this walkthrough

In [6]:
y = np.log(dat['PRICE']) # OUTCOME: log selling price (some houses are ridiculuously expensive)
X = dat[['BATHRM','ROOMS','LANDAREA']] # PREDICTORS: simplified for our purposes
X['LANDAREA'] = np.log(X['LANDAREA']) # Log land area (some house are on a lot of land, other on barely a plot)

Split the data

In [7]:
train_X, test_X, train_y, test_y = train_test_split(X,y,test_size=.25,random_state=1988)

Explore the training data

In [8]:
# Look at the training data
train_X.shape 
Out[8]:
(750, 3)

How is the data distributed?

In [9]:
(
    ggplot(train_X.melt(),aes(x="value")) +
    geom_histogram() +
    facet_wrap("variable",scales="free") +
    theme_minimal() +
    theme(figure_size = (10,3)) 
) 
Out[9]:
<ggplot: (8762103043357)>

Insights:

  • Scale will be an issue. So we need to pre-process
  • BATHRM and ROOMS follow a binomial distribution

Look at the outcome PRICE and see if there is any red flags.

In [10]:
(
    ggplot(pd.DataFrame(dict(price=train_y)),
           aes(x="price")) +
    geom_histogram() +
    theme_minimal() +
    theme(figure_size = (7,3)) 
)
Out[10]:
<ggplot: (-9223363274760488276)>

Let's just look at how our predictors relate to the outcome.

In [11]:
D = train_X.copy()
D['PRICE'] = train_y

(
    ggplot(D.melt(id_vars=["PRICE"]),
           aes(x="value",y="PRICE"))+
    geom_point(alpha=.5) +
    facet_wrap("variable",scales="free") +
    geom_smooth(method="lm",se=False,color="red") +
    theme_minimal() +
    theme(figure_size = (10,3)) 
)
Out[11]:
<ggplot: (8762144576113)>

Pre-Processing

Let's rescale our training data for now, but we'll put the preprocessing into a modeling pipeline later on.

In [12]:
# Let's 
scaler = pp.MinMaxScaler()
col_names = list(train_X)
train_X = scaler.fit_transform(train_X)
In [13]:
# Convert back into data frame
train_X = pd.DataFrame(train_X,columns=col_names)
train_X
Out[13]:
BATHRM ROOMS LANDAREA
0 0.375 0.35 0.342411
1 0.250 0.40 0.577270
2 0.125 0.30 0.617689
3 0.125 0.30 0.297460
4 0.375 0.45 0.276442
... ... ... ...
745 0.250 0.40 0.369246
746 0.250 0.30 0.333062
747 0.250 0.40 0.711232
748 0.250 0.35 0.206009
749 0.500 0.75 0.624465

750 rows × 3 columns

Modeling

The problem that we're working on is a regression problem (i.e. we have a continuously distributed outcome variable). We've played around with a few models that can tackle problems such as these.

  • Linear Regression
  • KNN
  • Decision Trees
  • Bagging
  • Random Forest

Luckily, implementing each of these (very different) algorithms is straight-forward using the sklearn API.

For each model, we'll use K-fold cross-validation to estimate the test error. As we saw a few weeks back, we can do this easily with the cross_validate method. We'll use 5 folds when cross-validating.

Recall that in order to make valid comparisons across models, we need to use the same data splits for each fold. We can do this by creating a KFold generator that will ensure we're using the same break points when training the model.

In [14]:
fold_generator = KFold(n_splits=5, shuffle=True,random_state=111)

Let's run each model individually (using defaults for any tuning parameters... more on this later); let's then compare the performance of the different models in a plot.

We'll use mean squared error as our performance metrics.

In [15]:
use_metrics = ["neg_mean_squared_error"]

Run the Models

Linear Model

In [16]:
lm_scores = cross_validate(LM(),train_X,train_y, cv = fold_generator, scoring =use_metrics)

KNN

In [17]:
knn_scores = cross_validate(KNN(),train_X,train_y, cv = fold_generator, scoring =use_metrics)

Decision Tree

In [18]:
dt_scores = cross_validate(DTree(),train_X,train_y, cv = fold_generator, scoring =use_metrics)

Side Note: Plotting a Decision Tree

First, a quick side note on the decision trees. When we built a tree from scratch, we discussed looking at the decision rules of the tree. There is a very straight forward way of doing this in sklearn.

In [19]:
mod = DTree(max_depth=3) # Initialize the modeling object (just as we did)
mod.fit(train_X,train_y) # Fit the mode

# Plot the tree
plt.figure(figsize=(12,8),dpi=300)
rules = tree.plot_tree(mod,feature_names = col_names,fontsize=8)

Bagging

In [20]:
bag_scores = cross_validate(Bag(),train_X,train_y, cv = fold_generator, scoring =use_metrics)

Random Forest

In [21]:
rf_scores = cross_validate(RF(),train_X,train_y, cv = fold_generator, scoring =use_metrics)

Compare Models

In [22]:
# Output is a dictionary 
lm_scores
Out[22]:
{'fit_time': array([0.00570798, 0.00451398, 0.00458312, 0.00424194, 0.00383592]),
 'score_time': array([0.003268  , 0.00279593, 0.00283694, 0.00266433, 0.00251102]),
 'test_neg_mean_squared_error': array([-0.58631913, -0.51890753, -0.53932654, -0.47869447, -0.93168324])}

We can collect the scores of the metrics we care about.

In [23]:
# Collect all the metrics we care about as a dictionary 
collect_scores = \
dict(lm = lm_scores['test_neg_mean_squared_error']*-1,
     knn = knn_scores['test_neg_mean_squared_error']*-1,
     dt = dt_scores['test_neg_mean_squared_error']*-1,
     bag = bag_scores['test_neg_mean_squared_error']*-1,
     rf = rf_scores['test_neg_mean_squared_error']*-1)

# Convert to a data frame and reshape
collect_scores = pd.DataFrame(collect_scores).melt(var_name="Model",value_name="MSE")
collect_scores
Out[23]:
Model MSE
0 lm 0.586319
1 lm 0.518908
2 lm 0.539327
3 lm 0.478694
4 lm 0.931683
5 knn 0.709668
6 knn 0.564161
7 knn 0.544246
8 knn 0.578741
9 knn 0.998509
10 dt 1.158642
11 dt 0.952509
12 dt 1.139198
13 dt 0.949445
14 dt 1.506385
15 bag 0.815154
16 bag 0.712796
17 bag 0.748747
18 bag 0.650133
19 bag 1.155072
20 rf 0.859218
21 rf 0.656243
22 rf 0.709177
23 rf 0.645240
24 rf 1.106075

And Plot

In [24]:
# Get the order of the models
order = (collect_scores.groupby('Model').mean().sort_values(by="MSE").index.tolist())

# Plot
(
    ggplot(collect_scores,
          aes(x="Model",y="MSE")) +
    geom_boxplot() +
    scale_x_discrete(limits=order) +
    labs(x="Model",y="Mean Squared Error") +
    coord_flip() +
    theme_minimal() +
    theme(dpi=150)
)
Out[24]:
<ggplot: (-9223363274810763804)>

Lower is better. The linear model appears to do the best here.

Tuning Hyper-parameters

As we learned this week in the synchronous lectures, there are a number of parameters that we can tweak/tune that will change our model's performance.

For example, recall the K-Nearest Neighbors model. The value we set k (or n_neighbors in sklearn) at (i.e. how many neighbors in the training data that we look at when making our prediction for the test data) determines the degree to which we overfit or underfit the data.

In [25]:
set_k = dict()
for k in [1,5,10,50,100,250]:
    score = cross_validate(KNN(n_neighbors=k),
                           train_X,train_y, 
                           cv = fold_generator, 
                           scoring =use_metrics)
    s = score['test_neg_mean_squared_error']
    set_k[k] = s.mean()
 
set_k
Out[25]:
{1: -1.388088955689103,
 5: -0.6790648291052591,
 10: -0.6099530646497288,
 50: -0.5969684060363545,
 100: -0.6129444358560227,
 250: -0.6466538980663833}

As we can see from the above example, when we set higher values of k, we get a better out of sample fit (k = 1 being the worse), but after k = 50, we shift out of overfitting territory and into underfitting territory. So choosing the right value of k is crucial to making this method work.

Let's learn how to systematically tune parameters using sklearn's GridSearchCV.

GridSearchCV

A Grid search performance an exhaustive search over specified parameter values for an estimator. That is, we specify all potential values our tuning parameter can take on and then we use cross-validation to compare the out-of-sample performance for different parameter configurations. The result is that we can identify the parameter that performs best.

To list off all the parameters for a model:

In [26]:
mod = KNN() # Initialize the model class
mod.get_params() # report all the available tunning parameters 
Out[26]:
{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}

We can read sklearn's documentation to learn what each parameter does. But for now, let's focus on n_neighbors.

For the grid search, we just need to pass it a dictionary of all the tuning parameter values that we want to explore.

In [27]:
knn_tune_params = {'n_neighbors':[1, 10, 25, 35, 50, 75, 100, 250]}

From there we wrap the model method in the GridSearchCV() class. Note n_jobs just tells sklearn the number of cores to use on your computer to run the model. Since we're running the same model many times under different configurations, we can parallelize the process so it runs quicker.

In [28]:
tune_knn = GridSearchCV(KNN(),knn_tune_params,
                        cv = fold_generator,
                        scoring='neg_mean_squared_error',
                        n_jobs=4)

We then .fit() the model as per usual

In [29]:
tune_knn.fit(train_X,train_y)
Out[29]:
GridSearchCV(cv=KFold(n_splits=5, random_state=111, shuffle=True),
             estimator=KNeighborsRegressor(), n_jobs=4,
             param_grid={'n_neighbors': [1, 10, 25, 35, 50, 75, 100, 250]},
             scoring='neg_mean_squared_error')

Once fit, the model reports back the best fitting parameters.

In [30]:
tune_knn.best_params_
Out[30]:
{'n_neighbors': 25}
In [31]:
tune_knn.best_score_
Out[31]:
-0.5850209044409723

We can also gather information on the other attempts

In [32]:
tune_knn.cv_results_
Out[32]:
{'mean_fit_time': array([0.00479965, 0.00422468, 0.00420165, 0.00425835, 0.00418167,
        0.00426736, 0.00420904, 0.00406456]),
 'std_fit_time': array([3.41704828e-04, 1.44432598e-04, 7.93173859e-05, 7.23466476e-05,
        8.42438344e-05, 1.05744013e-04, 6.63825147e-05, 1.13050136e-04]),
 'mean_score_time': array([0.00370641, 0.00395451, 0.0044178 , 0.00457244, 0.00494294,
        0.005516  , 0.00599351, 0.00861745]),
 'std_score_time': array([7.92289583e-05, 1.35830193e-04, 1.04819676e-04, 5.71374291e-05,
        8.21555137e-05, 9.51262524e-05, 8.10825940e-05, 3.01865280e-04]),
 'param_n_neighbors': masked_array(data=[1, 10, 25, 35, 50, 75, 100, 250],
              mask=[False, False, False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'n_neighbors': 1},
  {'n_neighbors': 10},
  {'n_neighbors': 25},
  {'n_neighbors': 35},
  {'n_neighbors': 50},
  {'n_neighbors': 75},
  {'n_neighbors': 100},
  {'n_neighbors': 250}],
 'split0_test_score': array([-2.13984951, -0.62408241, -0.5880639 , -0.59468237, -0.61274797,
        -0.62287667, -0.6382852 , -0.68633844]),
 'split1_test_score': array([-1.04771273, -0.50433109, -0.4887398 , -0.50267953, -0.50630502,
        -0.51770122, -0.5290712 , -0.56836742]),
 'split2_test_score': array([-1.1697432 , -0.49829077, -0.49206878, -0.4929248 , -0.498598  ,
        -0.51010189, -0.51775249, -0.53619051]),
 'split3_test_score': array([-0.98116526, -0.47905105, -0.44308925, -0.43911157, -0.45111988,
        -0.4600715 , -0.45962209, -0.49445058]),
 'split4_test_score': array([-1.60197409, -0.94401001, -0.91314278, -0.91993459, -0.91607116,
        -0.91686848, -0.9199912 , -0.94792254]),
 'mean_test_score': array([-1.38808896, -0.60995306, -0.5850209 , -0.58986657, -0.59696841,
        -0.60552395, -0.61294444, -0.6466539 ]),
 'std_test_score': array([0.43359163, 0.17467312, 0.17072081, 0.17244967, 0.16807376,
        0.16444522, 0.16402673, 0.16360589]),
 'rank_test_score': array([8, 5, 1, 2, 3, 4, 6, 7], dtype=int32)}

Model Tuning

With the above in hand, let's tune our hyperparameters for our various models. We've already tuned our KNN. Let's do the same for the other models. (Note: For the linear model, there are no hyperparameters, and Bagging is an ensemble approach rather than it's own model, so we won't consider it in our tune.)

Decision Tree

All the parameters for a decision tree. Let's focus on max_depth.

In [33]:
DTree().get_params()
Out[33]:
{'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'presort': 'deprecated',
 'random_state': None,
 'splitter': 'best'}
In [34]:
tune_dt = GridSearchCV(DTree(),{'max_depth':[i for i in range(10)]},
                        cv = fold_generator,
                        scoring='neg_mean_squared_error',
                        n_jobs=4)
In [35]:
tune_dt.fit(train_X,train_y)
Out[35]:
GridSearchCV(cv=KFold(n_splits=5, random_state=111, shuffle=True),
             estimator=DecisionTreeRegressor(), n_jobs=4,
             param_grid={'max_depth': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]},
             scoring='neg_mean_squared_error')
In [36]:
tune_dt.best_params_
Out[36]:
{'max_depth': 2}
In [37]:
tune_dt.best_score_
Out[37]:
-0.6120433542815239

Random Forest

For the random forest, we see many of the same parameters as a decision tree, which makes sense given how the random forest works. We'll focus on max_depth and n_estimators (i.e. the number of trees to grow and ensemble across) and max_features (i.e. the number of predictors to consider when growing each tree).

Note that there are only three features here, so there is only so much the randomness in the Random Forest will help out, but as the feature space grows, Random Forest becomes a very useful learning method.

In [38]:
RF().get_params()
Out[38]:
{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}
In [39]:
rf_params = {'max_depth':[1,2,3],
             'n_estimators':[100,500,1000],
              'max_features': [1,2]} # Only have three total. 
tune_rf = GridSearchCV(RF(),rf_params,
                        cv = fold_generator,
                        scoring='neg_mean_squared_error',
                        n_jobs=4)
In [40]:
# This model will take longer to run given all the tuning parameters we're exploring
tune_rf.fit(train_X,train_y) 
Out[40]:
GridSearchCV(cv=KFold(n_splits=5, random_state=111, shuffle=True),
             estimator=RandomForestRegressor(), n_jobs=4,
             param_grid={'max_depth': [1, 2, 3], 'max_features': [1, 2],
                         'n_estimators': [100, 500, 1000]},
             scoring='neg_mean_squared_error')
In [41]:
tune_rf.best_params_
Out[41]:
{'max_depth': 3, 'max_features': 2, 'n_estimators': 1000}
In [42]:
tune_rf.best_score_
Out[42]:
-0.5887192624087579

Modeling Pipeline

Let's piece together all these steps as one seamless pipeline!

In [43]:
# (0) Split the data 
train_X, test_X, train_y, test_y = train_test_split(X,y,test_size=.25,random_state=1988)

# (1) Set the folds index to ensure comparable samples
fold_generator = KFold(n_splits=5, shuffle=True,random_state=111)

# (2) Next specify the preprocessing steps
preprocess = ColumnTransformer(transformers=[('num', pp.MinMaxScaler(), ['BATHRM','ROOMS','LANDAREA'])])


# (3) Next Let's create our model pipe (note for the model we leave none as a placeholder)
pipe = Pipeline(steps=[('pre_process', preprocess),
                       ('model',None)])


# (4) Specify the models and their repsective tuning parameters. 
# Note the naming convention here to reference the model key
search_space = [
    # Linear Model
    {'model' : [LM()]},
    
    # KNN with K tuning param
    {'model' : [KNN()],
     'model__n_neighbors':[10,15,20,25,30]},
    
    # Decision Tree with the Max Depth Param
    {'model': [DTree()],
     'model__max_depth':[1,2,3,5]},
    
    # The Bagging decision tree model 
    {'model': [Bag()]},
    
    # Random forest with the N Estimators tuning param
    {'model' : [RF()],
     'model__max_depth':[1,2,3],
     'model__n_estimators':[500,1000,1250]},
]


# (5) Put it all together in the grid search
search = GridSearchCV(pipe, search_space, 
                      cv = fold_generator,
                      scoring='neg_mean_squared_error',
                      n_jobs=4)

# (6) Fit the model to the training data
search.fit(train_X,train_y)
Out[43]:
GridSearchCV(cv=KFold(n_splits=5, random_state=111, shuffle=True),
             estimator=Pipeline(steps=[('pre_process',
                                        ColumnTransformer(transformers=[('num',
                                                                         MinMaxScaler(),
                                                                         ['BATHRM',
                                                                          'ROOMS',
                                                                          'LANDAREA'])])),
                                       ('model', None)]),
             n_jobs=4,
             param_grid=[{'model': [LinearRegression()]},
                         {'model': [KNeighborsRegressor(n_neighbors=20)],
                          'model__n_neighbors': [10, 15, 20, 25, 30]},
                         {'model': [DecisionTreeRegressor()],
                          'model__max_depth': [1, 2, 3, 5]},
                         {'model': [BaggingRegressor()]},
                         {'model': [RandomForestRegressor()],
                          'model__max_depth': [1, 2, 3],
                          'model__n_estimators': [500, 1000, 1250]}],
             scoring='neg_mean_squared_error')

Let's look at the best fit score from the scan.

In [44]:
search.best_score_ # Mean out-of-sample (CV) error
Out[44]:
-0.578793336246449

Look at the best model

In [45]:
search.best_params_
Out[45]:
{'model': KNeighborsRegressor(n_neighbors=20), 'model__n_neighbors': 20}

Test Performance

Let's assume that our work is all done. We scanned the tuning parameters for all our candidate models and found the best performing model. Let's now see how well it does on the test data that it wasn't trained on.

Note that our new data is automatically preprocessed before being fit thanks to setting everything up as a pipeline!

In [46]:
# Predict() method will use the best model out of the scan
pred_y = search.predict(test_X)
In [47]:
m.mean_squared_error(test_y,pred_y)
Out[47]:
0.5090400475018597
In [48]:
m.r2_score(test_y,pred_y)
Out[48]:
0.2466639400199072

Visualize

In [49]:
(
    ggplot(pd.DataFrame(dict(pred=pred_y,truth=test_y)),
          aes(x='pred',y="truth")) +
    geom_point(alpha=.75) +
    geom_abline(linetype="dashed",color="darkred",size=1) +
    theme_bw() +
    theme(figure_size=(10,7))
)
Out[49]:
<ggplot: (-9223363274751490770)>