PPOL564 | Data Science 1 | Foundations

Statistical Learning Walkthrough

Comparing and Tuning Models

Dependencies¶

# Data Management/Investigation
import pandas as pd
import numpy as np
import missingno as miss
from plotnine import *
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

# For pre-processing data 
from sklearn import preprocessing as pp 
from sklearn.compose import ColumnTransformer 

# For splits and CV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold # Cross validation 
from sklearn.model_selection import cross_validate # Cross validation 
from sklearn.model_selection import GridSearchCV # Cross validation + param. tuning.

# Machine learning methods 
from sklearn.linear_model import LinearRegression as LM
from sklearn.neighbors import KNeighborsRegressor as KNN
from sklearn.tree import DecisionTreeRegressor as DTree
from sklearn import tree # For plotting the decision tree rules
from sklearn.ensemble import BaggingRegressor as Bag
from sklearn.ensemble import RandomForestRegressor as RF

# For evaluating our model's performance
import sklearn.metrics as m

# Pipeline to combine modeling elements
from sklearn.pipeline import Pipeline

Data¶

DC Housing Price data.

# Data on Housing Prices in DC
dat = pd.read_csv("dc_housing_prices.csv")
dat.shape

(107154, 39)

# Available features
list(dat)

['OBJECTID',
 'SSL',
 'BATHRM',
 'HF_BATHRM',
 'HEAT',
 'HEAT_D',
 'AC',
 'NUM_UNITS',
 'ROOMS',
 'BEDRM',
 'AYB',
 'YR_RMDL',
 'EYB',
 'STORIES',
 'SALEDATE',
 'PRICE',
 'QUALIFIED',
 'SALE_NUM',
 'GBA',
 'BLDG_NUM',
 'STYLE',
 'STYLE_D',
 'STRUCT',
 'STRUCT_D',
 'GRADE',
 'GRADE_D',
 'CNDTN',
 'CNDTN_D',
 'EXTWALL',
 'EXTWALL_D',
 'ROOF',
 'ROOF_D',
 'INTWALL',
 'INTWALL_D',
 'KITCHENS',
 'FIREPLACES',
 'USECODE',
 'LANDAREA',
 'GIS_LAST_MOD_DTTM']

Look at missingness in the data: Issue we're missing information on the outcome PRICE

miss.matrix(dat)

<matplotlib.axes._subplots.AxesSubplot at 0x7f81582c6e10>

Let's do the following:

Only retain data values where the outcome is measured.
Only retain observations where there is a selling price (i.e. when PRICE == 0, the house hasn't been sold. We're interested in what predict the sell price given that it was sold.)
Subsample the data. We'll only look at some of this data for now (doing this so models will run faster, you can run the rest of the data in the coding discussion.)

# Drop all observations where the outcome is missing 
dat = dat[~dat.PRICE.isna()]

# Only houses that are sold
dat = dat[dat.PRICE > 0]

# Subsample
dat = dat.sample(1000,random_state=1988)

Set up data in the SKLEARN framework (i.e. outcome separated from the predictors).

Note: I'm not going use all the data for this walkthrough

y = np.log(dat['PRICE']) # OUTCOME: log selling price (some houses are ridiculuously expensive)
X = dat[['BATHRM','ROOMS','LANDAREA']] # PREDICTORS: simplified for our purposes
X['LANDAREA'] = np.log(X['LANDAREA']) # Log land area (some house are on a lot of land, other on barely a plot)

Split the data

train_X, test_X, train_y, test_y = train_test_split(X,y,test_size=.25,random_state=1988)

Explore the training data¶

# Look at the training data
train_X.shape

(750, 3)

How is the data distributed?

(
    ggplot(train_X.melt(),aes(x="value")) +
    geom_histogram() +
    facet_wrap("variable",scales="free") +
    theme_minimal() +
    theme(figure_size = (10,3)) 
)

<ggplot: (8762103043357)>

Insights:

Scale will be an issue. So we need to pre-process
BATHRM and ROOMS follow a binomial distribution

Look at the outcome PRICE and see if there is any red flags.

(
    ggplot(pd.DataFrame(dict(price=train_y)),
           aes(x="price")) +
    geom_histogram() +
    theme_minimal() +
    theme(figure_size = (7,3)) 
)

<ggplot: (-9223363274760488276)>

Let's just look at how our predictors relate to the outcome.

D = train_X.copy()
D['PRICE'] = train_y

(
    ggplot(D.melt(id_vars=["PRICE"]),
           aes(x="value",y="PRICE"))+
    geom_point(alpha=.5) +
    facet_wrap("variable",scales="free") +
    geom_smooth(method="lm",se=False,color="red") +
    theme_minimal() +
    theme(figure_size = (10,3)) 
)

<ggplot: (8762144576113)>

Pre-Processing¶

Let's rescale our training data for now, but we'll put the preprocessing into a modeling pipeline later on.

# Let's 
scaler = pp.MinMaxScaler()
col_names = list(train_X)
train_X = scaler.fit_transform(train_X)

# Convert back into data frame
train_X = pd.DataFrame(train_X,columns=col_names)
train_X

Modeling¶

The problem that we're working on is a regression problem (i.e. we have a continuously distributed outcome variable). We've played around with a few models that can tackle problems such as these.

Linear Regression
KNN
Decision Trees
Bagging
Random Forest

Luckily, implementing each of these (very different) algorithms is straight-forward using the sklearn API.

For each model, we'll use K-fold cross-validation to estimate the test error. As we saw a few weeks back, we can do this easily with the cross_validate method. We'll use 5 folds when cross-validating.

Recall that in order to make valid comparisons across models, we need to use the same data splits for each fold. We can do this by creating a KFold generator that will ensure we're using the same break points when training the model.

fold_generator = KFold(n_splits=5, shuffle=True,random_state=111)

Let's run each model individually (using defaults for any tuning parameters... more on this later); let's then compare the performance of the different models in a plot.

We'll use mean squared error as our performance metrics.

use_metrics = ["neg_mean_squared_error"]

Run the Models¶

Linear Model¶

lm_scores = cross_validate(LM(),train_X,train_y, cv = fold_generator, scoring =use_metrics)

KNN¶

knn_scores = cross_validate(KNN(),train_X,train_y, cv = fold_generator, scoring =use_metrics)

Decision Tree¶

dt_scores = cross_validate(DTree(),train_X,train_y, cv = fold_generator, scoring =use_metrics)

Side Note: Plotting a Decision Tree¶

First, a quick side note on the decision trees. When we built a tree from scratch, we discussed looking at the decision rules of the tree. There is a very straight forward way of doing this in sklearn.

mod = DTree(max_depth=3) # Initialize the modeling object (just as we did)
mod.fit(train_X,train_y) # Fit the mode

# Plot the tree
plt.figure(figsize=(12,8),dpi=300)
rules = tree.plot_tree(mod,feature_names = col_names,fontsize=8)

Bagging¶

bag_scores = cross_validate(Bag(),train_X,train_y, cv = fold_generator, scoring =use_metrics)

Random Forest¶

rf_scores = cross_validate(RF(),train_X,train_y, cv = fold_generator, scoring =use_metrics)

Compare Models¶

# Output is a dictionary 
lm_scores

{'fit_time': array([0.00570798, 0.00451398, 0.00458312, 0.00424194, 0.00383592]),
 'score_time': array([0.003268  , 0.00279593, 0.00283694, 0.00266433, 0.00251102]),
 'test_neg_mean_squared_error': array([-0.58631913, -0.51890753, -0.53932654, -0.47869447, -0.93168324])}

We can collect the scores of the metrics we care about.

# Collect all the metrics we care about as a dictionary 
collect_scores = \
dict(lm = lm_scores['test_neg_mean_squared_error']*-1,
     knn = knn_scores['test_neg_mean_squared_error']*-1,
     dt = dt_scores['test_neg_mean_squared_error']*-1,
     bag = bag_scores['test_neg_mean_squared_error']*-1,
     rf = rf_scores['test_neg_mean_squared_error']*-1)

# Convert to a data frame and reshape
collect_scores = pd.DataFrame(collect_scores).melt(var_name="Model",value_name="MSE")
collect_scores

And Plot

# Get the order of the models
order = (collect_scores.groupby('Model').mean().sort_values(by="MSE").index.tolist())

# Plot
(
    ggplot(collect_scores,
          aes(x="Model",y="MSE")) +
    geom_boxplot() +
    scale_x_discrete(limits=order) +
    labs(x="Model",y="Mean Squared Error") +
    coord_flip() +
    theme_minimal() +
    theme(dpi=150)
)

<ggplot: (-9223363274810763804)>

Lower is better. The linear model appears to do the best here.

Tuning Hyper-parameters¶

As we learned this week in the synchronous lectures, there are a number of parameters that we can tweak/tune that will change our model's performance.

For example, recall the K-Nearest Neighbors model. The value we set k (or n_neighbors in sklearn) at (i.e. how many neighbors in the training data that we look at when making our prediction for the test data) determines the degree to which we overfit or underfit the data.

set_k = dict()
for k in [1,5,10,50,100,250]:
    score = cross_validate(KNN(n_neighbors=k),
                           train_X,train_y, 
                           cv = fold_generator, 
                           scoring =use_metrics)
    s = score['test_neg_mean_squared_error']
    set_k[k] = s.mean()
 
set_k

{1: -1.388088955689103,
 5: -0.6790648291052591,
 10: -0.6099530646497288,
 50: -0.5969684060363545,
 100: -0.6129444358560227,
 250: -0.6466538980663833}

As we can see from the above example, when we set higher values of k, we get a better out of sample fit (k = 1 being the worse), but after k = 50, we shift out of overfitting territory and into underfitting territory. So choosing the right value of k is crucial to making this method work.

Let's learn how to systematically tune parameters using sklearn's GridSearchCV.

`GridSearchCV`¶

A Grid search performance an exhaustive search over specified parameter values for an estimator. That is, we specify all potential values our tuning parameter can take on and then we use cross-validation to compare the out-of-sample performance for different parameter configurations. The result is that we can identify the parameter that performs best.

To list off all the parameters for a model:

mod = KNN() # Initialize the model class
mod.get_params() # report all the available tunning parameters

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}

We can read sklearn's documentation to learn what each parameter does. But for now, let's focus on n_neighbors.

For the grid search, we just need to pass it a dictionary of all the tuning parameter values that we want to explore.

knn_tune_params = {'n_neighbors':[1, 10, 25, 35, 50, 75, 100, 250]}

From there we wrap the model method in the GridSearchCV() class. Note n_jobs just tells sklearn the number of cores to use on your computer to run the model. Since we're running the same model many times under different configurations, we can parallelize the process so it runs quicker.

tune_knn = GridSearchCV(KNN(),knn_tune_params,
                        cv = fold_generator,
                        scoring='neg_mean_squared_error',
                        n_jobs=4)

We then .fit() the model as per usual

tune_knn.fit(train_X,train_y)

GridSearchCV(cv=KFold(n_splits=5, random_state=111, shuffle=True),
             estimator=KNeighborsRegressor(), n_jobs=4,
             param_grid={'n_neighbors': [1, 10, 25, 35, 50, 75, 100, 250]},
             scoring='neg_mean_squared_error')

Once fit, the model reports back the best fitting parameters.

tune_knn.best_params_

{'n_neighbors': 25}

tune_knn.best_score_

-0.5850209044409723

We can also gather information on the other attempts

tune_knn.cv_results_

{'mean_fit_time': array([0.00479965, 0.00422468, 0.00420165, 0.00425835, 0.00418167,
        0.00426736, 0.00420904, 0.00406456]),
 'std_fit_time': array([3.41704828e-04, 1.44432598e-04, 7.93173859e-05, 7.23466476e-05,
        8.42438344e-05, 1.05744013e-04, 6.63825147e-05, 1.13050136e-04]),
 'mean_score_time': array([0.00370641, 0.00395451, 0.0044178 , 0.00457244, 0.00494294,
        0.005516  , 0.00599351, 0.00861745]),
 'std_score_time': array([7.92289583e-05, 1.35830193e-04, 1.04819676e-04, 5.71374291e-05,
        8.21555137e-05, 9.51262524e-05, 8.10825940e-05, 3.01865280e-04]),
 'param_n_neighbors': masked_array(data=[1, 10, 25, 35, 50, 75, 100, 250],
              mask=[False, False, False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'n_neighbors': 1},
  {'n_neighbors': 10},
  {'n_neighbors': 25},
  {'n_neighbors': 35},
  {'n_neighbors': 50},
  {'n_neighbors': 75},
  {'n_neighbors': 100},
  {'n_neighbors': 250}],
 'split0_test_score': array([-2.13984951, -0.62408241, -0.5880639 , -0.59468237, -0.61274797,
        -0.62287667, -0.6382852 , -0.68633844]),
 'split1_test_score': array([-1.04771273, -0.50433109, -0.4887398 , -0.50267953, -0.50630502,
        -0.51770122, -0.5290712 , -0.56836742]),
 'split2_test_score': array([-1.1697432 , -0.49829077, -0.49206878, -0.4929248 , -0.498598  ,
        -0.51010189, -0.51775249, -0.53619051]),
 'split3_test_score': array([-0.98116526, -0.47905105, -0.44308925, -0.43911157, -0.45111988,
        -0.4600715 , -0.45962209, -0.49445058]),
 'split4_test_score': array([-1.60197409, -0.94401001, -0.91314278, -0.91993459, -0.91607116,
        -0.91686848, -0.9199912 , -0.94792254]),
 'mean_test_score': array([-1.38808896, -0.60995306, -0.5850209 , -0.58986657, -0.59696841,
        -0.60552395, -0.61294444, -0.6466539 ]),
 'std_test_score': array([0.43359163, 0.17467312, 0.17072081, 0.17244967, 0.16807376,
        0.16444522, 0.16402673, 0.16360589]),
 'rank_test_score': array([8, 5, 1, 2, 3, 4, 6, 7], dtype=int32)}

Model Tuning¶

With the above in hand, let's tune our hyperparameters for our various models. We've already tuned our KNN. Let's do the same for the other models. (Note: For the linear model, there are no hyperparameters, and Bagging is an ensemble approach rather than it's own model, so we won't consider it in our tune.)

Decision Tree¶

All the parameters for a decision tree. Let's focus on max_depth.

DTree().get_params()

{'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'presort': 'deprecated',
 'random_state': None,
 'splitter': 'best'}

tune_dt = GridSearchCV(DTree(),{'max_depth':[i for i in range(10)]},
                        cv = fold_generator,
                        scoring='neg_mean_squared_error',
                        n_jobs=4)

tune_dt.fit(train_X,train_y)

GridSearchCV(cv=KFold(n_splits=5, random_state=111, shuffle=True),
             estimator=DecisionTreeRegressor(), n_jobs=4,
             param_grid={'max_depth': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]},
             scoring='neg_mean_squared_error')

tune_dt.best_params_

{'max_depth': 2}

tune_dt.best_score_

-0.6120433542815239

Random Forest¶

For the random forest, we see many of the same parameters as a decision tree, which makes sense given how the random forest works. We'll focus on max_depth and n_estimators (i.e. the number of trees to grow and ensemble across) and max_features (i.e. the number of predictors to consider when growing each tree).

Note that there are only three features here, so there is only so much the randomness in the Random Forest will help out, but as the feature space grows, Random Forest becomes a very useful learning method.

RF().get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

rf_params = {'max_depth':[1,2,3],
             'n_estimators':[100,500,1000],
              'max_features': [1,2]} # Only have three total. 
tune_rf = GridSearchCV(RF(),rf_params,
                        cv = fold_generator,
                        scoring='neg_mean_squared_error',
                        n_jobs=4)

# This model will take longer to run given all the tuning parameters we're exploring
tune_rf.fit(train_X,train_y)

GridSearchCV(cv=KFold(n_splits=5, random_state=111, shuffle=True),
             estimator=RandomForestRegressor(), n_jobs=4,
             param_grid={'max_depth': [1, 2, 3], 'max_features': [1, 2],
                         'n_estimators': [100, 500, 1000]},
             scoring='neg_mean_squared_error')

tune_rf.best_params_

{'max_depth': 3, 'max_features': 2, 'n_estimators': 1000}

tune_rf.best_score_

-0.5887192624087579

Modeling Pipeline¶

Let's piece together all these steps as one seamless pipeline!

# (0) Split the data 
train_X, test_X, train_y, test_y = train_test_split(X,y,test_size=.25,random_state=1988)

# (1) Set the folds index to ensure comparable samples
fold_generator = KFold(n_splits=5, shuffle=True,random_state=111)

# (2) Next specify the preprocessing steps
preprocess = ColumnTransformer(transformers=[('num', pp.MinMaxScaler(), ['BATHRM','ROOMS','LANDAREA'])])


# (3) Next Let's create our model pipe (note for the model we leave none as a placeholder)
pipe = Pipeline(steps=[('pre_process', preprocess),
                       ('model',None)])


# (4) Specify the models and their repsective tuning parameters. 
# Note the naming convention here to reference the model key
search_space = [
    # Linear Model
    {'model' : [LM()]},
    
    # KNN with K tuning param
    {'model' : [KNN()],
     'model__n_neighbors':[10,15,20,25,30]},
    
    # Decision Tree with the Max Depth Param
    {'model': [DTree()],
     'model__max_depth':[1,2,3,5]},
    
    # The Bagging decision tree model 
    {'model': [Bag()]},
    
    # Random forest with the N Estimators tuning param
    {'model' : [RF()],
     'model__max_depth':[1,2,3],
     'model__n_estimators':[500,1000,1250]},
]


# (5) Put it all together in the grid search
search = GridSearchCV(pipe, search_space, 
                      cv = fold_generator,
                      scoring='neg_mean_squared_error',
                      n_jobs=4)

# (6) Fit the model to the training data
search.fit(train_X,train_y)

GridSearchCV(cv=KFold(n_splits=5, random_state=111, shuffle=True),
             estimator=Pipeline(steps=[('pre_process',
                                        ColumnTransformer(transformers=[('num',
                                                                         MinMaxScaler(),
                                                                         ['BATHRM',
                                                                          'ROOMS',
                                                                          'LANDAREA'])])),
                                       ('model', None)]),
             n_jobs=4,
             param_grid=[{'model': [LinearRegression()]},
                         {'model': [KNeighborsRegressor(n_neighbors=20)],
                          'model__n_neighbors': [10, 15, 20, 25, 30]},
                         {'model': [DecisionTreeRegressor()],
                          'model__max_depth': [1, 2, 3, 5]},
                         {'model': [BaggingRegressor()]},
                         {'model': [RandomForestRegressor()],
                          'model__max_depth': [1, 2, 3],
                          'model__n_estimators': [500, 1000, 1250]}],
             scoring='neg_mean_squared_error')

Let's look at the best fit score from the scan.

search.best_score_ # Mean out-of-sample (CV) error

-0.578793336246449

Look at the best model

search.best_params_

{'model': KNeighborsRegressor(n_neighbors=20), 'model__n_neighbors': 20}

Test Performance¶

Let's assume that our work is all done. We scanned the tuning parameters for all our candidate models and found the best performing model. Let's now see how well it does on the test data that it wasn't trained on.

Note that our new data is automatically preprocessed before being fit thanks to setting everything up as a pipeline!

# Predict() method will use the best model out of the scan
pred_y = search.predict(test_X)

m.mean_squared_error(test_y,pred_y)

0.5090400475018597

m.r2_score(test_y,pred_y)

0.2466639400199072

Visualize

(
    ggplot(pd.DataFrame(dict(pred=pred_y,truth=test_y)),
          aes(x='pred',y="truth")) +
    geom_point(alpha=.75) +
    geom_abline(linetype="dashed",color="darkred",size=1) +
    theme_bw() +
    theme(figure_size=(10,7))
)

<ggplot: (-9223363274751490770)>

	BATHRM	ROOMS	LANDAREA
0	0.375	0.35	0.342411
1	0.250	0.40	0.577270
2	0.125	0.30	0.617689
3	0.125	0.30	0.297460
4	0.375	0.45	0.276442
...	...	...	...
745	0.250	0.40	0.369246
746	0.250	0.30	0.333062
747	0.250	0.40	0.711232
748	0.250	0.35	0.206009
749	0.500	0.75	0.624465

	Model	MSE
0	lm	0.586319
1	lm	0.518908
2	lm	0.539327
3	lm	0.478694
4	lm	0.931683
5	knn	0.709668
6	knn	0.564161
7	knn	0.544246
8	knn	0.578741
9	knn	0.998509
10	dt	1.158642
11	dt	0.952509
12	dt	1.139198
13	dt	0.949445
14	dt	1.506385
15	bag	0.815154
16	bag	0.712796
17	bag	0.748747
18	bag	0.650133
19	bag	1.155072
20	rf	0.859218
21	rf	0.656243
22	rf	0.709177
23	rf	0.645240
24	rf	1.106075

PPOL564 | Data Science 1 | FoundationsStatistical Learning Walkthrough

Comparing and Tuning Models

Dependencies¶

Data¶

Explore the training data¶

Pre-Processing¶

Modeling¶

Run the Models¶

Linear Model¶

KNN¶

Decision Tree¶

Side Note: Plotting a Decision Tree¶

Bagging¶

Random Forest¶

Compare Models¶

Tuning Hyper-parameters¶

GridSearchCV¶

Model Tuning¶

Decision Tree¶

Random Forest¶

Modeling Pipeline¶

Test Performance¶

PPOL564 | Data Science 1 | Foundations

Statistical Learning Walkthrough

`GridSearchCV`¶