caret
In this notebook, we’ll apply the machine learning concepts covered in the supervised learning lecture on regression methods. Note that there are many libraries that can perform the methods that we reviewed in class, but we’ll focus here on using the caret
package to perform these operations.
require(tidyverse)
require(caret) # for machine learning
require(recipes) # For preprocessing your data
require(rsample) # for train test splits
require(rattle) # For nice tree plots
require(yardstick) # for performance metrics
# For parallelization (to run the models across many cores) -- speeds up computation!
# install.packages("doMC")
doMC::registerDoMC()
For this exercise, let’s examine some historical data of bike sharing in London downloaded from Kaggle. The data contains information on rider usage given a common set of climate and time predictors.
Assume we work for the company that runs the ride sharing venture: our aim is to build a model that best predicts the number of riders (cnt
) at any given moment in time so that we can distribute resources appropriately. Please see the Kaggle site for a description of the variables: https://www.kaggle.com/hmavrodiev/london-bike-sharing-dataset#london_merged.csv
dat = read_csv("london_bike_sharing.csv")
Parsed with column specification:
cols(
timestamp = col_datetime(format = ""),
cnt = col_double(),
t1 = col_double(),
t2 = col_double(),
hum = col_double(),
wind_speed = col_double(),
weather_code = col_double(),
is_holiday = col_double(),
is_weekend = col_double(),
season = col_double()
)
glimpse(dat)
Rows: 17,414
Columns: 10
$ timestamp <dttm> 2015-01-04 00:00:00, 2015-01-04 01:00:00, 2015-01-04 …
$ cnt <dbl> 182, 138, 134, 72, 47, 46, 51, 75, 131, 301, 528, 727,…
$ t1 <dbl> 3.0, 3.0, 2.5, 2.0, 2.0, 2.0, 1.0, 1.0, 1.5, 2.0, 3.0,…
$ t2 <dbl> 2.0, 2.5, 2.5, 2.0, 0.0, 2.0, -1.0, -1.0, -1.0, -0.5, …
$ hum <dbl> 93.0, 93.0, 96.5, 100.0, 93.0, 93.0, 100.0, 100.0, 96.…
$ wind_speed <dbl> 6.0, 5.0, 0.0, 0.0, 6.5, 4.0, 7.0, 7.0, 8.0, 9.0, 12.0…
$ weather_code <dbl> 3, 1, 1, 1, 1, 1, 4, 4, 4, 3, 3, 3, 4, 3, 3, 3, 3, 3, …
$ is_holiday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ is_weekend <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ season <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
Before event looking at the data, let’s split the sample up into a training and test dataset. We’ll completely hold off on viewing the test data, so as not to bias our development of the learning model.
set.seed(123)
splits = initial_split(dat,prop = .8)
train_data = training(splits) # Use 80% of the data as training data
test_data = testing(splits) # holdout 20% as test data
dim(train_data)
[1] 13932 10
dim(test_data)
[1] 3482 10
summary(train_data)
timestamp cnt t1
Min. :2015-01-04 00:00:00 Min. : 0.0 Min. :-1.50
1st Qu.:2015-07-03 14:45:00 1st Qu.: 252.8 1st Qu.: 8.00
Median :2016-01-02 10:30:00 Median : 840.0 Median :12.50
Mean :2016-01-03 02:55:35 Mean :1136.1 Mean :12.45
3rd Qu.:2016-07-04 07:45:00 3rd Qu.:1657.0 3rd Qu.:16.00
Max. :2017-01-03 23:00:00 Max. :7860.0 Max. :34.00
t2 hum wind_speed weather_code
Min. :-6.0 Min. : 20.5 Min. : 0.00 Min. : 1.000
1st Qu.: 6.0 1st Qu.: 63.0 1st Qu.:10.00 1st Qu.: 1.000
Median :12.5 Median : 75.0 Median :15.00 Median : 2.000
Mean :11.5 Mean : 72.5 Mean :15.87 Mean : 2.733
3rd Qu.:16.0 3rd Qu.: 83.0 3rd Qu.:20.50 3rd Qu.: 3.000
Max. :34.0 Max. :100.0 Max. :56.50 Max. :26.000
is_holiday is_weekend season
Min. :0.00000 Min. :0.0000 Min. :0.000
1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.000
Median :0.00000 Median :0.0000 Median :1.000
Mean :0.02146 Mean :0.2872 Mean :1.495
3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:3.000
Max. :1.00000 Max. :1.0000 Max. :3.000
Let’s look at the temporal coverage: Looks like the data roughly covers a two year time span.
dat %>%
summarize(min_date = min(timestamp),
max_date = max(timestamp))
Visualize the distribution for each variable.
First, let’s look at the numerical variables.
train_data %>%
select_if(is.numeric) %>%
gather(var,val) %>% # equivalent to pivot_longer
ggplot(aes(val,group=var)) +
geom_histogram(bins = 30) +
facet_wrap(~var,scales="free",ncol=2)
Most of the variables are continuous values. Some variables appear to be dummy (is_weekend
) or ordered (season
) variables. For the seasons, we may want to consider changing this variable to a dummy variable (where each season acts as a dummy variable) since the “order” in the seasons isn’t implied (i.e. winter isn’t necessarily worse than Fall, etc.).
Also, note that the weather code data is a bit of an enigma. There isn’t clear information online regarding what the code means, so we may want to drop it as we’re unsure about it.
It doesn’t appear that there is any missing data, so there is no need to impute any data.
sum(is.na(train_data)) # any missingness??
[1] 0
Finally, note that the outcome variable is a count, so we may want to log transform it so that it’s less skewed.
What do we need to do?
# First, turn the season variable into a categorical variable
clean_steps = function(.data){
.data %>%
mutate(season = as.factor(season)) %>%
select(-timestamp,-weather_code)
}
# Generate our recipe to preprocess the data
rcp <-
recipe(cnt~.,data = train_data %>% clean_steps) %>%
step_dummy(all_nominal(),-cnt) %>%
# log the dependent variable so it's more normal, offset the count by 1 for instances when cnt==0
step_log(cnt,offset=1) %>%
step_range(all_numeric(),-cnt) %>% # Normalize scale
prep()
# Apply the recipe to the training and test data
train_data2 <- bake(rcp,train_data %>% clean_steps)
test_data2 <- bake(rcp,test_data%>% clean_steps) # Need to transform the seasons data here as well.
Let’s examine the post-processed data.
train_data2 %>% glimpse()
Rows: 13,932
Columns: 10
$ t1 <dbl> 0.12676056, 0.12676056, 0.11267606, 0.09859155, 0.098591…
$ t2 <dbl> 0.2000, 0.2125, 0.2125, 0.2000, 0.2000, 0.1250, 0.1250, …
$ hum <dbl> 0.9119497, 0.9119497, 0.9559748, 1.0000000, 0.9119497, 1…
$ wind_speed <dbl> 0.10619469, 0.08849558, 0.00000000, 0.00000000, 0.070796…
$ is_holiday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ is_weekend <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ cnt <dbl> 5.209486, 4.934474, 4.905275, 4.290459, 3.850148, 3.9512…
$ season_X1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ season_X2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ season_X3 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
train_data2 %>%
select(cnt,hum,wind_speed,t1,t2) %>%
gather(var,val) %>%
ggplot(aes(val,group=var)) +
geom_histogram(bins = 30) +
facet_wrap(~var,scales="free",ncol=3)
When comparing different machine learning models, we want to make sure we’re making a fair and equal comparison. One way that random chance can sneak into our assessments of model fit is through cross-validation. We want to make sure that we’re cross-validating the data on the exact same data partitions.
caret
makes this ease to do. Let’s use the k-fold cross-validation method with 10 folds in the data.
set.seed(1988) # set a seed for replication purposes
folds <- createFolds(train_data2$cnt, k = 10) # Partition the data into 10 equal folds
sapply(folds,length)
Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10
1393 1393 1393 1393 1394 1393 1393 1393 1394 1393
Now, let’s use the trainControl()
function from caret
to set up our validation conditions
control_conditions <-
trainControl(method='cv', # K-fold cross validation
index = folds # The indices for our folds (so they are always the same)
)
We’ll now use this same cross-validation object for everything that we do.
Let’s explore the different models that we covered in the lecture using the same package framework. As we saw last time in class, caret
facilitates this task nicely.
mod_lm <-
train(cnt ~ ., # Equation (outcome and everything else)
data=train_data2, # Training data
method = "lm", # linear model
metric = "RMSE", # mean squared error
trControl = control_conditions # Cross validation conditions
)
Let’s look at the performance plots. What do we see?
mod_lm
Linear Regression
13932 samples
9 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 1393, 1393, 1393, 1393, 1394, 1393, ...
Resampling results:
RMSE Rsquared MAE
1.086514 0.2836169 0.8490306
Tuning parameter 'intercept' was held constant at a value of TRUE
Remember: these algorithms are computationally demanding, so they can take a little time to run depending on the power of your machine
mod_knn <-
train(cnt ~ ., # Equation (outcome and everything else)
data=train_data2, # Training data
method = "knn", # K-Nearest Neighbors Algorithm
metric = "RMSE", # mean squared error
trControl = control_conditions # Cross validation conditions
)
Let’s look at the performance plots. What do we see?
mod_knn
k-Nearest Neighbors
13932 samples
9 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 1393, 1393, 1393, 1393, 1394, 1393, ...
Resampling results across tuning parameters:
k RMSE Rsquared MAE
5 1.156543 0.2201065 0.8775010
7 1.133500 0.2357252 0.8641692
9 1.121732 0.2448883 0.8582290
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 9.
caret
has default settings that auto explore different tunning parameters for the model.plot(mod_knn)
# Draw out the best model (the one with the best performance)
mod_knn$finalModel
9-nearest neighbor regression model
Now, let’s say we wanted to adjust the tunning parameters. We can explore different model tunings by using the tuneGrid
argument.
# Different values of the tuning parameter that I want to try.
knn_tune = expand.grid(k = c(1,3,10,50))
mod_knn <-
train(cnt ~ ., # Equation (outcome and everything else)
data=train_data2, # Training data
method = "knn", # K-Nearest Neighbors Algorithm
metric = "RMSE", # mean squared error
trControl = control_conditions, # Cross validation conditions
tuneGrid = knn_tune # Vary the tuning parameter K
)
Look at performance: it doesn’t look that we get much of a boost when going all the way to K-Neighbors.
plot(mod_knn)
mod_cart <-
train(cnt ~ ., # Equation (outcome and everything else)
data=train_data2, # Training data
method = "rpart", # Regression tree
metric = "RMSE", # mean squared error
trControl = control_conditions # Cross validation conditions
)
There were missing values in resampled performance measures.
mod_cart
CART
13932 samples
9 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 1393, 1393, 1393, 1393, 1394, 1393, ...
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.01485955 1.132224 0.2231094 0.8932458
0.03063670 1.145529 0.2040795 0.9084821
0.18437882 1.224750 0.1756298 0.9938378
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.01485955.
Note that the main tuning parameter here is the complexity parameter (cp
): this reflect how deep we let the tree grow. It looks like the more complex models perform worse on out-of-sample data.
plot(mod_cart)
Let’s visualize the tree we just grew. It looks like the humidity level is an important factor when predicting othe number of riders.
# This tree goes really deep
fancyRpartPlot(mod_cart$finalModel)
We can actually print out the larger decision tree to look at it as a print out.
print(mod_cart$finalModel)
n= 13932
node), split, n, deviance, yval
* denotes terminal node
1) root 13932 22943.490 6.432077
2) hum>=0.581761 9595 15936.980 6.061609
4) hum>=0.7578616 4792 7853.797 5.790636 *
5) hum< 0.7578616 4803 7380.268 6.331962 *
3) hum< 0.581761 4337 2776.215 7.251685 *
Let’s free the CART to grow deeper.
tune_cart <- expand.grid(cp = c(0.0010281)) # Complexity Parameter (how "deep" our trees should grow)
mod_cart2 <-
train(cnt ~ ., # Equation (outcome and everything else)
data=train_data2, # Training data
method = "rpart", # Classification Tree
metric = "RMSE", # mean squared error
tuneGrid = tune_cart, # Tuning parameters
trControl = control_conditions
)
Doesn’t doo a whole lot better on out of sample data.
print(mod_cart2)
CART
13932 samples
9 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 1393, 1393, 1393, 1393, 1394, 1393, ...
Resampling results:
RMSE Rsquared MAE
1.183767 0.2028586 0.9005489
Tuning parameter 'cp' was held constant at a value of 0.0010281
Let’s look at the tree itself: much deeper!
print(mod_cart2$finalModel)
n= 13932
node), split, n, deviance, yval
* denotes terminal node
1) root 13932 22943.49000 6.432077
2) hum>=0.581761 9595 15936.98000 6.061609
4) hum>=0.7578616 4792 7853.79700 5.790636
8) t1< 0.2605634 1448 2597.67200 5.436264
16) t1< 0.07746479 121 252.09270 4.926297 *
17) t1>=0.07746479 1327 2311.24200 5.482765
34) season_X3< 0.5 567 961.29990 5.246741 *
35) season_X3>=0.5 760 1294.79100 5.658851 *
9) t1>=0.2605634 3344 4995.54700 5.944084
18) hum>=0.827044 1763 2698.72400 5.825637
36) hum>=0.9213836 426 715.51500 5.624051 *
37) hum< 0.9213836 1337 1960.38100 5.889867
74) t2< 0.55625 1215 1791.01900 5.846821
148) season_X3< 0.5 876 1310.68500 5.760265
296) season_X2< 0.5 382 495.01600 5.521033 *
297) season_X2>=0.5 494 776.90090 5.945258 *
149) season_X3>=0.5 339 456.81180 6.070490
298) wind_speed>=0.2743363 155 215.12410 5.741844 *
299) wind_speed< 0.2743363 184 210.84380 6.347338 *
75) t2>=0.55625 122 144.68960 6.318561 *
19) hum< 0.827044 1581 2244.50700 6.076167
38) t1< 0.4014085 741 1101.11800 5.941280 *
39) t1>=0.4014085 840 1118.01300 6.195157
78) season_X1>=0.5 377 476.53400 5.940694
156) t1< 0.471831 104 109.92070 5.448241 *
157) t1>=0.471831 273 331.78420 6.128295 *
79) season_X1< 0.5 463 597.19050 6.402355 *
5) hum< 0.7578616 4803 7380.26800 6.331962
10) t2< 0.43125 2115 3563.82200 6.076456
20) hum>=0.6823899 1059 1876.21200 5.886286
40) t2< 0.20625 280 552.97330 5.632387 *
41) t2>=0.20625 779 1298.70100 5.977546 *
21) hum< 0.6823899 1056 1610.90400 6.267166
42) season_X1>=0.5 18 23.61613 5.115742 *
43) season_X1< 0.5 1038 1563.01000 6.287133
86) is_weekend>=0.5 287 273.05880 6.046075 *
87) is_weekend< 0.5 751 1266.90100 6.379255
174) season_X2< 0.5 647 1131.67100 6.319931
348) season_X3< 0.5 296 604.93370 6.078264 *
349) season_X3>=0.5 351 494.87190 6.523729 *
175) season_X2>=0.5 104 118.78640 6.748322 *
11) t2>=0.43125 2688 3569.73200 6.533001
22) t1< 0.556338 2234 3067.06200 6.450509
44) season_X1>=0.5 736 1095.98800 6.115332
88) t2< 0.55625 393 577.46810 5.946407 *
89) t2>=0.55625 343 494.45650 6.308881 *
45) season_X1< 0.5 1498 1847.76300 6.615189
90) t1< 0.471831 1207 1563.04500 6.534331 *
91) t1>=0.471831 291 244.09550 6.950569 *
23) t1>=0.556338 454 412.66180 6.938922
46) hum>=0.6823899 217 233.97940 6.659410
92) wind_speed< 0.1371681 32 38.65078 5.617526 *
93) wind_speed>=0.1371681 185 154.58330 6.839628 *
47) hum< 0.6823899 237 146.20610 7.194846 *
3) hum< 0.581761 4337 2776.21500 7.251685
6) t1< 0.556338 2821 2045.30300 7.046150
12) hum>=0.4685535 1576 1499.99300 6.835304
24) season_X2< 0.5 1206 1232.08800 6.737609
48) wind_speed< 0.2876106 524 648.99110 6.551511 *
49) wind_speed>=0.2876106 682 551.00610 6.880593
98) t2< 0.18125 63 52.57687 6.209068 *
99) t2>=0.18125 619 467.12820 6.948939 *
25) season_X2>=0.5 370 218.87730 7.153737 *
13) hum< 0.4685535 1245 386.55590 7.313053
26) t1< 0.3873239 489 153.89480 7.100934 *
27) t1>=0.3873239 756 196.42720 7.450257 *
7) t1>=0.556338 1516 389.98270 7.634149
14) hum>=0.3490566 960 273.05850 7.518575 *
15) hum< 0.3490566 556 81.96064 7.833701 *
mod_rf <-
train(cnt ~ ., # Equation (outcome and everything else)
data=train_data2, # Training data
method = "ranger", # random forest (ranger is much faster than rf)
metric = "RMSE", # mean squared error
trControl = control_conditions
)
There are three tunning parameters in this model:
mtry
: the number of predictors that we’ll randomly select.splitrule
: the way we determine how the nodes should be split
variance
= takes the split that maximizes the variance (minimizes the error).extratrees
= doesn’t bag, randomly select vars, but just randomly split, accept the best split. (Short for “Extremely Randomize Trees”)mod_rf
Random Forest
13932 samples
9 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 1393, 1393, 1393, 1393, 1394, 1393, ...
Resampling results across tuning parameters:
mtry splitrule RMSE Rsquared MAE
2 variance 1.081815 0.2902953 0.8334833
2 extratrees 1.091484 0.2931923 0.8514838
5 variance 1.113119 0.2601100 0.8485382
5 extratrees 1.093944 0.2777063 0.8331818
9 variance 1.122639 0.2519127 0.8557224
9 extratrees 1.105312 0.2678914 0.8398917
Tuning parameter 'min.node.size' was held constant at a value of 5
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 2, splitrule = variance
and min.node.size = 5.
plot(mod_rf)
mod_rf$bestTune
How did the different methods perform? Which one did the best?
# Organize all model imputs as a list.
mod_list <-
list(
lm = mod_lm,
knn = mod_knn,
cart = mod_cart,
cart_deep = mod_cart2,
rf = mod_rf
)
# Resamples allows us to compare model output
resamples(mod_list)
Call:
resamples.default(x = mod_list)
Models: lm, knn, cart, cart_deep, rf
Number of resamples: 10
Performance metrics: MAE, RMSE, Rsquared
Time estimates for: everything, final model fit
Examine the error across each of the models.
dotplot(resamples(mod_list),metric = "RMSE")
Examine the fit across each of the models.
dotplot(resamples(mod_list),metric = "Rsquared")
Of the models we ran, it looks like the Random Forest model did the best. Let’s now see how well it does on predicting the test data.
pred <- predict(mod_rf,newdata = test_data2)
mse = sum((test_data2$cnt-pred)^2)/nrow(test_data2)
rmse_score = sqrt(mse)
rmse_score
[1] 1.024161
We can also use the yardstick
package to calculate performance metrics.
# Generate a prediction on our test data.
pred <- predict(mod_rf,newdata = test_data2)
# Organize as a data frame
performance = tibble(truth=test_data2$cnt,estimate = pred)
# Calculate performance metrics
bind_rows(
performance %>% rmse(truth,estimate), # Root Mean Squared Error
performance %>% rsq(truth,estimate) # R Squared
)