Overview

In this notebook, we’ll apply the machine learning concepts covered in the supervised learning lecture on regression methods. Note that there are many libraries that can perform the methods that we reviewed in class, but we’ll focus here on using the caret package to perform these operations.

Dependencies

require(tidyverse)
require(caret) # for machine learning
require(recipes) # For preprocessing your data
require(rsample) # for train test splits
require(rattle) # For nice tree plots
require(yardstick) # for performance metrics

# For parallelization (to run the models across many cores) -- speeds up computation!
# install.packages("doMC")
doMC::registerDoMC()

Data

For this exercise, let’s examine some historical data of bike sharing in London downloaded from Kaggle. The data contains information on rider usage given a common set of climate and time predictors.

Assume we work for the company that runs the ride sharing venture: our aim is to build a model that best predicts the number of riders (cnt) at any given moment in time so that we can distribute resources appropriately. Please see the Kaggle site for a description of the variables: https://www.kaggle.com/hmavrodiev/london-bike-sharing-dataset#london_merged.csv

dat = read_csv("london_bike_sharing.csv")
Parsed with column specification:
cols(
  timestamp = col_datetime(format = ""),
  cnt = col_double(),
  t1 = col_double(),
  t2 = col_double(),
  hum = col_double(),
  wind_speed = col_double(),
  weather_code = col_double(),
  is_holiday = col_double(),
  is_weekend = col_double(),
  season = col_double()
)
glimpse(dat)
Rows: 17,414
Columns: 10
$ timestamp    <dttm> 2015-01-04 00:00:00, 2015-01-04 01:00:00, 2015-01-04 …
$ cnt          <dbl> 182, 138, 134, 72, 47, 46, 51, 75, 131, 301, 528, 727,…
$ t1           <dbl> 3.0, 3.0, 2.5, 2.0, 2.0, 2.0, 1.0, 1.0, 1.5, 2.0, 3.0,…
$ t2           <dbl> 2.0, 2.5, 2.5, 2.0, 0.0, 2.0, -1.0, -1.0, -1.0, -0.5, …
$ hum          <dbl> 93.0, 93.0, 96.5, 100.0, 93.0, 93.0, 100.0, 100.0, 96.…
$ wind_speed   <dbl> 6.0, 5.0, 0.0, 0.0, 6.5, 4.0, 7.0, 7.0, 8.0, 9.0, 12.0…
$ weather_code <dbl> 3, 1, 1, 1, 1, 1, 4, 4, 4, 3, 3, 3, 4, 3, 3, 3, 3, 3, …
$ is_holiday   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ is_weekend   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ season       <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …

Split the Sample: Training and test data

Before event looking at the data, let’s split the sample up into a training and test dataset. We’ll completely hold off on viewing the test data, so as not to bias our development of the learning model.

set.seed(123)
splits = initial_split(dat,prop = .8)
train_data = training(splits) # Use 80% of the data as training data 
test_data = testing(splits) # holdout 20% as test data 

dim(train_data)
[1] 13932    10
dim(test_data) 
[1] 3482   10

Examine the Data

summary(train_data)
   timestamp                        cnt               t1       
 Min.   :2015-01-04 00:00:00   Min.   :   0.0   Min.   :-1.50  
 1st Qu.:2015-07-03 14:45:00   1st Qu.: 252.8   1st Qu.: 8.00  
 Median :2016-01-02 10:30:00   Median : 840.0   Median :12.50  
 Mean   :2016-01-03 02:55:35   Mean   :1136.1   Mean   :12.45  
 3rd Qu.:2016-07-04 07:45:00   3rd Qu.:1657.0   3rd Qu.:16.00  
 Max.   :2017-01-03 23:00:00   Max.   :7860.0   Max.   :34.00  
       t2            hum          wind_speed     weather_code   
 Min.   :-6.0   Min.   : 20.5   Min.   : 0.00   Min.   : 1.000  
 1st Qu.: 6.0   1st Qu.: 63.0   1st Qu.:10.00   1st Qu.: 1.000  
 Median :12.5   Median : 75.0   Median :15.00   Median : 2.000  
 Mean   :11.5   Mean   : 72.5   Mean   :15.87   Mean   : 2.733  
 3rd Qu.:16.0   3rd Qu.: 83.0   3rd Qu.:20.50   3rd Qu.: 3.000  
 Max.   :34.0   Max.   :100.0   Max.   :56.50   Max.   :26.000  
   is_holiday        is_weekend         season     
 Min.   :0.00000   Min.   :0.0000   Min.   :0.000  
 1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.000  
 Median :0.00000   Median :0.0000   Median :1.000  
 Mean   :0.02146   Mean   :0.2872   Mean   :1.495  
 3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:3.000  
 Max.   :1.00000   Max.   :1.0000   Max.   :3.000  

Let’s look at the temporal coverage: Looks like the data roughly covers a two year time span.

dat %>% 
  summarize(min_date = min(timestamp),
            max_date = max(timestamp))

Visualize the distribution for each variable.

First, let’s look at the numerical variables.

train_data %>% 
  select_if(is.numeric) %>% 
  gather(var,val) %>% # equivalent to pivot_longer
  ggplot(aes(val,group=var)) +
  geom_histogram(bins = 30) +
  facet_wrap(~var,scales="free",ncol=2)