caret
In this notebook, we’ll apply the machine learning concepts covered in the supervised learning lecture on regression methods. Note that there are many libraries that can perform the methods that we reviewed in class, but we’ll focus here on using the caret
package to perform these operations.
require(tidyverse)
require(caret) # for machine learning
require(recipes) # For preprocessing your data
require(rsample) # for train test splits
require(rattle) # For nice tree plots
require(yardstick) # for performance metrics
# For parallelization (to run the models across many cores) -- speeds up computation!
# install.packages("doMC")
doMC::registerDoMC()
For this exercise, let’s examine some historical data of bike sharing in London downloaded from Kaggle. The data contains information on rider usage given a common set of climate and time predictors.
Assume we work for the company that runs the ride sharing venture: our aim is to build a model that best predicts the number of riders (cnt
) at any given moment in time so that we can distribute resources appropriately. Please see the Kaggle site for a description of the variables: https://www.kaggle.com/hmavrodiev/london-bike-sharing-dataset#london_merged.csv
dat = read_csv("london_bike_sharing.csv")
Parsed with column specification:
cols(
timestamp = col_datetime(format = ""),
cnt = col_double(),
t1 = col_double(),
t2 = col_double(),
hum = col_double(),
wind_speed = col_double(),
weather_code = col_double(),
is_holiday = col_double(),
is_weekend = col_double(),
season = col_double()
)
glimpse(dat)
Rows: 17,414
Columns: 10
$ timestamp <dttm> 2015-01-04 00:00:00, 2015-01-04 01:00:00, 2015-01-04 …
$ cnt <dbl> 182, 138, 134, 72, 47, 46, 51, 75, 131, 301, 528, 727,…
$ t1 <dbl> 3.0, 3.0, 2.5, 2.0, 2.0, 2.0, 1.0, 1.0, 1.5, 2.0, 3.0,…
$ t2 <dbl> 2.0, 2.5, 2.5, 2.0, 0.0, 2.0, -1.0, -1.0, -1.0, -0.5, …
$ hum <dbl> 93.0, 93.0, 96.5, 100.0, 93.0, 93.0, 100.0, 100.0, 96.…
$ wind_speed <dbl> 6.0, 5.0, 0.0, 0.0, 6.5, 4.0, 7.0, 7.0, 8.0, 9.0, 12.0…
$ weather_code <dbl> 3, 1, 1, 1, 1, 1, 4, 4, 4, 3, 3, 3, 4, 3, 3, 3, 3, 3, …
$ is_holiday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ is_weekend <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ season <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
Before event looking at the data, let’s split the sample up into a training and test dataset. We’ll completely hold off on viewing the test data, so as not to bias our development of the learning model.
set.seed(123)
splits = initial_split(dat,prop = .8)
train_data = training(splits) # Use 80% of the data as training data
test_data = testing(splits) # holdout 20% as test data
dim(train_data)
[1] 13932 10
dim(test_data)
[1] 3482 10
summary(train_data)
timestamp cnt t1
Min. :2015-01-04 00:00:00 Min. : 0.0 Min. :-1.50
1st Qu.:2015-07-03 14:45:00 1st Qu.: 252.8 1st Qu.: 8.00
Median :2016-01-02 10:30:00 Median : 840.0 Median :12.50
Mean :2016-01-03 02:55:35 Mean :1136.1 Mean :12.45
3rd Qu.:2016-07-04 07:45:00 3rd Qu.:1657.0 3rd Qu.:16.00
Max. :2017-01-03 23:00:00 Max. :7860.0 Max. :34.00
t2 hum wind_speed weather_code
Min. :-6.0 Min. : 20.5 Min. : 0.00 Min. : 1.000
1st Qu.: 6.0 1st Qu.: 63.0 1st Qu.:10.00 1st Qu.: 1.000
Median :12.5 Median : 75.0 Median :15.00 Median : 2.000
Mean :11.5 Mean : 72.5 Mean :15.87 Mean : 2.733
3rd Qu.:16.0 3rd Qu.: 83.0 3rd Qu.:20.50 3rd Qu.: 3.000
Max. :34.0 Max. :100.0 Max. :56.50 Max. :26.000
is_holiday is_weekend season
Min. :0.00000 Min. :0.0000 Min. :0.000
1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.000
Median :0.00000 Median :0.0000 Median :1.000
Mean :0.02146 Mean :0.2872 Mean :1.495
3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:3.000
Max. :1.00000 Max. :1.0000 Max. :3.000
Let’s look at the temporal coverage: Looks like the data roughly covers a two year time span.
dat %>%
summarize(min_date = min(timestamp),
max_date = max(timestamp))
Visualize the distribution for each variable.
First, let’s look at the numerical variables.
train_data %>%
select_if(is.numeric) %>%
gather(var,val) %>% # equivalent to pivot_longer
ggplot(aes(val,group=var)) +
geom_histogram(bins = 30) +
facet_wrap(~var,scales="free",ncol=2)