Tutorial on tidymodels for Machine Learning

caret is a well known R package for machine learning, which includes almost everything from data pre-processing to cross-validation. The unofficial successor of caret is tidymodels, which has a modular approach meaning that specific, smaller packages are designed to work hand in hand. Thus, tidymodels is to modeling what the tidyverse is to data wrangling. Herein, I will walk through a machine learning example from start to end and explain how to use the appropriate tidymodels packages at each place.

Set Up

Loading the tidymodels package loads a bunch of packages for modeling and also a few others from the tidyverse like ggplot2 and dplyr. Furthermore, we’ll need tune and workflows, which are also part of the tidymodels ecosystem but are not attached by library("tidymodels").

library("conflicted")
library("tidymodels")
#> -- Attaching packages ------------------------------------------------ tidymodels 0.0.3 --
#> v broom     0.5.4     v purrr     0.3.3
#> v dials     0.0.4     v recipes   0.1.9
#> v dplyr     0.8.3     v rsample   0.0.5
#> v ggplot2   3.2.1     v tibble    2.1.3
#> v infer     0.5.1     v yardstick 0.0.5
#> v parsnip   0.0.5
library("tune")
library("workflows")

# Additional packages for dataviz etc.
library("ggrepel")     # for geom_label_repel()
library("corrplot")    # for corrplot()
#> corrplot 0.84 loaded

conflict_prefer("filter", "dplyr")
#> [conflicted] Will prefer dplyr::filter over any other package

ggplot2::theme_set(theme_light())

Data Set: Diamonds

With ggplot2 comes the diamonds data set, which has information on the size and quality of diamonds. Herein, we’ll use these features to predict the price of a diamond.

data("diamonds")
diamonds %>%
    sample_n(2000) %>% 
    mutate_if(is.factor, as.numeric) %>%
    select(price, everything()) %>%
    cor %>%
    {.[order(abs(.[, 1]), decreasing = TRUE), 
       order(abs(.[, 1]), decreasing = TRUE)]} %>%
    corrplot(method = "number", type = "upper", mar = c(0, 0, 1.5, 0),
             title = "Correlations between price and various features of diamonds")

Separating Testing and Training Data: rsample

First of all, we want to extract a data set for testing the predictions in the end. We’ll only use a small proportion for training (only to speed things up a little). Furthermore, the training data set will be prepared for 3-fold cross-validation (using three here to speed things up). All this is accomplished using the rsample package:

set.seed(1243)

dia_split <- initial_split(diamonds, prop = .1, strata = price)

dia_train <- training(dia_split)
dia_test  <- testing(dia_split)

dim(dia_train)
#> [1] 5395   10
dim(dia_test)
#> [1] 48545    10

dia_vfold <- vfold_cv(dia_train, v = 3, repeats = 1, strata = price)
dia_vfold %>% 
  mutate(df_ana = map(splits, analysis),
         df_ass = map(splits, assessment))
#> #  3-fold cross-validation using stratification 
#> # A tibble: 3 x 4
#>   splits              id    df_ana                df_ass               
#> * <named list>        <chr> <named list>          <named list>         
#> 1 <split [3.6K/1.8K]> Fold1 <tibble [3,596 x 10]> <tibble [1,799 x 10]>
#> 2 <split [3.6K/1.8K]> Fold2 <tibble [3,596 x 10]> <tibble [1,799 x 10]>
#> 3 <split [3.6K/1.8K]> Fold3 <tibble [3,598 x 10]> <tibble [1,797 x 10]>

Data Pre-Processing and Feature Engineering: recipes

The recipes package can be used to prepare a data set (for modeling) using different step_*() functions. For example, the plot below indicates that there may be a nonlinear relationship between price and carat, and I want to address that using higher-order terms.

qplot(carat, price, data = dia_train) +
    scale_y_continuous(trans = log_trans(), labels = function(x) round(x, -2)) +
    geom_smooth(method = "lm", formula = "y ~ poly(x, 4)") +
    labs(title = "Nonlinear relationship between price and carat of diamonds",
         subtitle = "The degree of the polynomial is a potential tuning parameter")

The recipe() takes a formula and a data set, and then the different steps are added using the appropriate step_*() functions. The recipes package comes with a ton of useful step functions (see, e.g., vignette("Simple_Example", package = "recipes")).

Herein, I want to log transform price (step_log()), I want to center and scale all numeric predictors (step_normalize()), and the categorical predictors should be dummy coded (step_dummy()). Furthermore, a quadratic effect of carat is added using step_poly().

dia_rec <-
    recipe(price ~ ., data = dia_train) %>%
    step_log(all_outcomes()) %>%
    step_normalize(all_predictors(), -all_nominal()) %>%
    step_dummy(all_nominal()) %>%
    step_poly(carat, degree = 2) %>% 
    prep() 

dia_rec
#> Data Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor          9
#> 
#> Training data contained 5395 data points and no missing data.
#> 
#> Operations:
#> 
#> Log transformation on price [trained]
#> Centering and scaling for carat, depth, table, x, y, z [trained]
#> Dummy variables from cut, color, clarity [trained]
#> Orthogonal polynomials on carat [trained]

Calling prep() on a recipe applies all the steps. You can now call juice() to extract the transformed data set or call bake() on a new data set.

# Note the linear and quadratic term for carat and the dummies for e.g. color
dia_juiced <- juice(dia_rec)
dim(dia_juiced)
#> [1] 5395   25
names(dia_juiced)
#>  [1] "depth"        "table"        "x"            "y"            "z"           
#>  [6] "price"        "cut_1"        "cut_2"        "cut_3"        "cut_4"       
#> [11] "color_1"      "color_2"      "color_3"      "color_4"      "color_5"     
#> [16] "color_6"      "clarity_1"    "clarity_2"    "clarity_3"    "clarity_4"   
#> [21] "clarity_5"    "clarity_6"    "clarity_7"    "carat_poly_1" "carat_poly_2"

Defining and Fitting Models: parsnip

The parsnip package has wrappers around many1 popular machine learning algorithms, and you can fit them using a unified interface. This is extremely helpful, since you have to remember only one rather then dozens of interfaces.

The models are separated into two modes/categories, namely, regression and classification (set_mode()). The model is defined using a function specific to each algorithm (e.g., linear_reg(), rand_forest()). Finally, the backend/engine/implementation is selected using set_engine().

Herein, I will start with a basic linear regression model as implemented in stats::lm().

lm_model <-
    linear_reg() %>%
    set_mode("regression") %>%
    set_engine("lm")

Furthermore, take the example of a random forest model. This could be fit using packages ranger or randomForest. Both have different interfaces (e.g., argument ntree vs. num.trees), and parsnip removes the hassle of remembering both interfaces. More general arguments pertaining to the algorithm are specified in the algorithm function (e.g., rand_forest()). Arguments specific to the engine are specified in set_engine().

rand_forest(mtry = 3, trees = 500, min_n = 5) %>%
    set_mode("regression") %>%
    set_engine("ranger", importance = "impurity_corrected")

Finally, we can fit() the model.

lm_fit1 <- fit(lm_model, price ~ ., dia_juiced)
lm_fit1
#> parsnip model object
#> 
#> Fit time:  20ms 
#> 
#> Call:
#> stats::lm(formula = formula, data = data)
#> 
#> Coefficients:
#>  (Intercept)         depth         table             x             y  
#>     7.711965      0.010871      0.005889      0.251155      0.054422  
#>            z         cut_1         cut_2         cut_3         cut_4  
#>     0.054196      0.106701     -0.026356      0.024207     -0.006191  
#>      color_1       color_2       color_3       color_4       color_5  
#>    -0.455831     -0.084108     -0.004810      0.009725     -0.005591  
#>      color_6     clarity_1     clarity_2     clarity_3     clarity_4  
#>    -0.009730      0.860961     -0.242698      0.132234     -0.052903  
#>    clarity_5     clarity_6     clarity_7  carat_poly_1  carat_poly_2  
#>     0.028996      0.002403      0.022235     51.663971    -17.508316

You can use fit() with a formula (e.g., price ~ .) or by specifying x and y. In both cases, I recommend keeping only the variables you need when preparing the data set, since this will prevent forgetting the new variable d when using y ~ a + b + c. Unnecessary variables can easily be dropped in the recipe using step_rm().

Summarizing Fitted Models: broom

Many models have implemented summary() or coef() methods. However, the output of these is usually not in a tidy format, and the broom package has the aim to resolve this issue.

glance() gives us information about the whole model. Here, R squared is pretty high and the RMSE equals 0.154.

glance(lm_fit1$fit)
#> # A tibble: 1 x 11
#>   r.squared adj.r.squared sigma statistic p.value    df logLik    AIC    BIC deviance
#>       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl>  <dbl>  <dbl>    <dbl>
#> 1     0.977         0.977 0.154     9607.       0    25  2457. -4863. -4691.     127.
#> # ... with 1 more variable: df.residual <int>

tidy() gives us information about the model parameters, and we see that we have a significant quadratic effect of carat.

tidy(lm_fit1) %>% 
  arrange(desc(abs(statistic)))
#> # A tibble: 25 x 5
#>    term         estimate std.error statistic  p.value
#>    <chr>           <dbl>     <dbl>     <dbl>    <dbl>
#>  1 (Intercept)    7.71     0.00431   1790.   0.      
#>  2 carat_poly_2 -17.5      0.259      -67.7  0.      
#>  3 clarity_1      0.861    0.0135      64.0  0.      
#>  4 color_1       -0.456    0.00738    -61.8  0.      
#>  5 carat_poly_1  51.7      1.10        47.0  0.      
#>  6 clarity_2     -0.243    0.0127     -19.2  3.63e-79
#>  7 color_2       -0.0841   0.00665    -12.7  3.50e-36
#>  8 clarity_3      0.132    0.0107      12.3  2.51e-34
#>  9 cut_1          0.107    0.00997     10.7  1.88e-26
#> 10 clarity_4     -0.0529   0.00839     -6.30 3.15e-10
#> # ... with 15 more rows

Finally, augment() can be used to get model predictions, residuals, etc.

lm_predicted <- augment(lm_fit1$fit, data = dia_juiced) %>% 
    rowid_to_column()
select(lm_predicted, rowid, price, .fitted:.std.resid)
#> # A tibble: 5,395 x 9
#>    rowid price .fitted .se.fit  .resid    .hat .sigma   .cooksd .std.resid
#>    <int> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>     <dbl>      <dbl>
#>  1     1  5.83    5.90 0.0133  -0.0769 0.00744  0.154 0.0000755     -0.502
#>  2     2  5.84    5.80 0.0110   0.0472 0.00507  0.154 0.0000193      0.307
#>  3     3  5.86    5.89 0.0138  -0.0286 0.00801  0.154 0.0000112     -0.187
#>  4     4  5.88    6.24 0.00988 -0.360  0.00413  0.154 0.000910      -2.34 
#>  5     5  6.00    6.24 0.0104  -0.244  0.00458  0.154 0.000467      -1.59 
#>  6     6  6.00    6.06 0.0100  -0.0615 0.00427  0.154 0.0000275     -0.401
#>  7     7  6.00    6.10 0.0111  -0.0999 0.00520  0.154 0.0000887     -0.651
#>  8     8  6.00    6.06 0.0104  -0.0566 0.00459  0.154 0.0000251     -0.369
#>  9     9  6.00    6.22 0.00895 -0.216  0.00339  0.154 0.000270      -1.41 
#> 10    10  6.32    6.55 0.00872 -0.233  0.00321  0.154 0.000297      -1.52 
#> # ... with 5,385 more rows

A plot of the predicted vs. actual prices shows small residuals with a few outliers, which are not well explained by the model.

ggplot(lm_predicted, aes(.fitted, price)) +
    geom_point(alpha = .2) +
    ggrepel::geom_label_repel(aes(label = rowid), 
                              data = filter(lm_predicted, abs(.resid) > 2)) +
    labs(title = "Actual vs. Predicted Price of Diamonds")

Evaluating Model Performance: yardstick

We already saw performance measures RMSE and R squared in the output of glance() above. The yardstick package is specifically designed for such measures for both numeric and categorical outcomes, and it plays well with multiple predictions.

Let’s use rsample, parsnip, and yardstick for cross-validation to get a more accurate estimation of RMSE.

In the following pipeline, the model is fit() separately to the three analysis data sets, and then the fitted models are used to predict() on the three corresponding assessment data sets (i.e., 3-fold cross-validation). Before that, analysis() and assessment() are used to extract the respective folds, and bake() is used to apply the recipe steps to these data sets. (Note that one could have created dia_vfold using the prepped rather than the raw data in the first place to get rid of these two bake() calls.)

Herein, I use list columns to store all information about the three folds in one data frame and a combination of dplyr::mutate() and purrr::map() to “loop” across the rows of the data frame.

lm_fit2 <- dia_vfold %>% 
    mutate(
        df_ana = map(splits,  analysis),
        df_ana = map(df_ana, ~bake(dia_rec, .x)),
        df_ass = map(splits,  assessment),
        df_ass = map(df_ass, ~bake(dia_rec, .x)),
        model_fit  = map(df_ana, ~fit(lm_model, price ~ ., data = .x)),
        model_pred = map2(model_fit, df_ass, ~predict(.x, new_data = .y)))
lm_fit2
#> #  3-fold cross-validation using stratification 
#> # A tibble: 3 x 6
#>   splits         id    df_ana          df_ass         model_fit  model_pred     
#> * <named list>   <chr> <named list>    <named list>   <named li> <named list>   
#> 1 <split [3.6K/~ Fold1 <tibble [3,596~ <tibble [1,79~ <fit[+]>   <tibble [1,799~
#> 2 <split [3.6K/~ Fold2 <tibble [3,596~ <tibble [1,79~ <fit[+]>   <tibble [1,799~
#> 3 <split [3.6K/~ Fold3 <tibble [3,598~ <tibble [1,79~ <fit[+]>   <tibble [1,797~

Now, we can extract the actual prices from the assessment data and compare them to the predicted prices. Across the three folds, we see that the RMSE is a little higher and R squared a little smaller compared to above (see output of glance(lm_fit1$fit)). This is expected, since out-of-sample prediction is harder but also way more useful.

lm_fit2 %>% 
    mutate(res = map2(df_ass, model_pred, ~data.frame(price = .x$price,
                                                      .pred = .y$.pred))) %>% 
    select(id, res) %>% 
    unnest(res) %>% 
    group_by(id) %>% 
    metrics(truth = price, estimate = .pred)
#> # A tibble: 9 x 4
#>   id    .metric .estimator .estimate
#>   <chr> <chr>   <chr>          <dbl>
#> 1 Fold1 rmse    standard       0.168
#> 2 Fold2 rmse    standard       0.147
#> 3 Fold3 rmse    standard       0.298
#> 4 Fold1 rsq     standard       0.973
#> 5 Fold2 rsq     standard       0.979
#> 6 Fold3 rsq     standard       0.918
#> 7 Fold1 mae     standard       0.116
#> 8 Fold2 mae     standard       0.115
#> 9 Fold3 mae     standard       0.110

Note that yardstick::metrics() has default measures for numeric and categorical outcomes, and here RMSE, R squared, and the mean absolute difference (MAE) is returned. You could also use one metric directly like rmse or define a custom set of metrics via metric_set().

Tuning Model Parameters: tune and dials

Let’s get a little bit more involved and do some hyperparameter tuning. We turn to a different model, namely, a random forest model.

The tune package has functions for doing the actual tuning (e.g., via grid search), while all the parameters and their defaults (e.g., mtry(), neighbors()) are implemented in dials. Thus, the two packages can almost only be used in combination.

Preparing a parsnip Model for Tuning

First, I want to tune the mtry parameter of a random forest model. Thus, the model is defined using parsnip as above. However, rather than using a default value (i.e., mtry = NULL) or one specific value (i.e., mtry = 3), we use tune() as a placeholder and let cross-validation decide on the best value for mtry later on.

As the output indicates, the default minimum of mtry is 1 and the maximum depends on the data.

rf_model <- 
    rand_forest(mtry = tune()) %>%
    set_mode("regression") %>%
    set_engine("ranger")

parameters(rf_model)
#> Collection of 1 parameters for tuning
#> 
#>    id parameter type object class
#>  mtry           mtry    nparam[?]
#> 
#> Model parameters needing finalization:
#>    # Randomly Selected Predictors ('mtry')
#> 
#> See `?dials::finalize` or `?dials::update.parameters` for more information.
mtry()
#> # Randomly Selected Predictors  (quantitative)
#> Range: [1, ?]

Thus, this model is not yet ready for fitting. You can either specify the maximum for mtry yourself using update(), or you can use finalize() to let the data decide on the maximum.

rf_model %>% 
    parameters %>% 
    update(mtry = mtry(c(1L, 5L)))
#> Collection of 1 parameters for tuning
#> 
#>    id parameter type object class
#>  mtry           mtry    nparam[+]

rf_model %>% 
    parameters %>% 
    # Here, the maximum of mtry equals the number of predictors, i.e., 24.
    finalize(x = select(juice(dia_rec), -price)) %>% 
    magrittr::extract2("object")
#> [[1]]
#> # Randomly Selected Predictors  (quantitative)
#> Range: [1, 24]

Preparing Data for Tuning: recipes

The second thing I want to tune is the degree of the polynomial for the variable carat. As you saw in the plot above, polynomials up to a degree of four seemed well suited for the data. However, a simpler model might do equally well, and we want to use cross-validation to decide on the degree that works best.

Similar to tuning parameters in a model, certain aspects of a recipe can be tuned. Let’s define a second recipe and use tune() inside step_poly().

# Note that this recipe cannot be prepped (and juiced), since "degree" is unknown
dia_rec2 <-
    recipe(price ~ ., data = dia_train) %>%
    step_log(all_outcomes()) %>%
    step_normalize(all_predictors(), -all_nominal()) %>%
    step_dummy(all_nominal()) %>%
    step_poly(carat, degree = tune())

dia_rec2 %>% 
    parameters() %>% 
    magrittr::extract2("object")
#> [[1]]
#> Polynomial Degree  (quantitative)
#> Range: [1, 3]

Combine Everything: workflows

The workflows package is designed to bundle together different parts of a machine learning pipeline like a recipe or a model.

First, let’s create an initial workflow and add the recipe and the random forest model, both of which have a tuning parameter.

rf_wflow <-
    workflow() %>%
    add_model(rf_model) %>%
    add_recipe(dia_rec2)
rf_wflow
#> == Workflow ==============================================================================
#> Preprocessor: Recipe
#> Model: rand_forest()
#> 
#> -- Preprocessor --------------------------------------------------------------------------
#> 4 Recipe Steps
#> 
#> * step_log()
#> * step_normalize()
#> * step_dummy()
#> * step_poly()
#> 
#> -- Model ---------------------------------------------------------------------------------
#> Random Forest Model Specification (regression)
#> 
#> Main Arguments:
#>   mtry = tune()
#> 
#> Computational engine: ranger

Second, we need to update the parameters in rf_wflow, because the maximum of mtry is not yet known and the maximum of degree should be four (while three is the default).

rf_param <-
    rf_wflow %>%
    parameters() %>%
    update(mtry = mtry(range = c(3L, 5L)),
           degree = degree_int(range = c(2L, 4L)))
rf_param$object
#> [[1]]
#> # Randomly Selected Predictors  (quantitative)
#> Range: [3, 5]
#> 
#> [[2]]
#> Polynomial Degree  (quantitative)
#> Range: [2, 4]

Third, we want to use cross-validation for tuning, that is, to select the best combination of the hyperparameters. Bayesian optimization (see vignette("svm_classification", package = "tune")) is recommended for complex tuning problems, and this can be done using tune_bayes().

Herein, however, grid search will suffice. To this end, let’s create a grid of all necessary parameter combinations.

rf_grid <- grid_regular(rf_param, levels = 3)
rf_grid
#> # A tibble: 9 x 2
#>    mtry degree
#>   <int>  <int>
#> 1     3      2
#> 2     4      2
#> 3     5      2
#> 4     3      3
#> 5     4      3
#> 6     5      3
#> 7     3      4
#> 8     4      4
#> 9     5      4

Cross-validation and hyperparameter tuning can involve fitting many models. Herein, for example, we have to fit 3 x 9 models (folds x parameter combinations). To increase speed, we can fit the models in parallel. This is directly supported by the tune package (see vignette("optimizations", package = "tune")).

library("doFuture")
all_cores <- parallel::detectCores(logical = FALSE)

registerDoFuture()
cl <- makeCluster(all_cores)
plan(future::cluster, workers = cl)

Then, we can finally start tuning.

rf_search <- tune_grid(rf_wflow, grid = rf_grid, resamples = dia_vfold,
                       param_info = rf_param)

The results can be examined using autoplot() and show_best():

autoplot(rf_search, metric = "rmse") +
    labs(title = "Results of Grid Search for Two Tuning Parameters of a Random Forest")

show_best(rf_search, "rmse", maximize = FALSE, n = 9)
#> # A tibble: 9 x 7
#>    mtry degree .metric .estimator  mean     n std_err
#>   <int>  <int> <chr>   <chr>      <dbl> <int>   <dbl>
#> 1     5      2 rmse    standard   0.121     3 0.00498
#> 2     5      3 rmse    standard   0.121     3 0.00454
#> 3     4      2 rmse    standard   0.122     3 0.00463
#> 4     5      4 rmse    standard   0.122     3 0.00471
#> 5     4      3 rmse    standard   0.123     3 0.00469
#> 6     4      4 rmse    standard   0.124     3 0.00496
#> 7     3      3 rmse    standard   0.128     3 0.00502
#> 8     3      2 rmse    standard   0.128     3 0.00569
#> 9     3      4 rmse    standard   0.128     3 0.00501

select_best(rf_search, metric = "rmse", maximize = FALSE)
#> # A tibble: 1 x 2
#>    mtry degree
#>   <int>  <int>
#> 1     5      2

select_by_one_std_err(rf_search, mtry, degree,
                      metric = "rmse", maximize = FALSE)
#> # A tibble: 1 x 9
#>    mtry degree .metric .estimator  mean     n std_err .best .bound
#>   <int>  <int> <chr>   <chr>      <dbl> <int>   <dbl> <dbl>  <dbl>
#> 1     4      2 rmse    standard   0.122     3 0.00463 0.121  0.126

With a cross-validation RMSE of ca. 0.12, the random forest model seems to outperform the linear regression from above. Furthermore, 0.12 is (hopefully) a realistic estimate of the out-of-sample error.

Selecting the Best Model to Make the Final Predictions

We saw above that a quadratic trend was enough to get a good model. Furthermore, cross-validation revealed that mtry = 4 seems to perform well.

To use this combination of hyperparameters, we fit() the corresponding model (or workflow, more precisely) on the whole training data set dia_train. This model (or workflow) can than be used to predict() on data never seen before, namely, dia_test in the present case.

rf_param_final <- 
    select_by_one_std_err(rf_search, mtry, degree,
                          metric = "rmse", maximize = FALSE)

rf_wflow_final <- 
    finalize_workflow(rf_wflow, rf_param_final)

rf_wflow_final_fit <- 
    fit(rf_wflow_final, data = dia_train)

dia_test$.pred <- predict(rf_wflow_final_fit, new_data = dia_test)$.pred
dia_test$logprice <- log(dia_test$price)

metrics(dia_test, truth = logprice, estimate = .pred)
#> # A tibble: 3 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rmse    standard      0.113 
#> 2 rsq     standard      0.988 
#> 3 mae     standard      0.0846

As you can see, we get an RMSE of 0.11, which is even slightly better compared to the cross-validation RMSE.

Summary

The tidymodels ecosystem bundles together a set of packages, that work hand in hand to solve machine-learning problems from start to end. Together with the data-wrangling facilities in the tidyverse and the plotting tools from ggplot2, this makes for a rich toolbox for every data scientist working with R.

The only thing that is definitely missing in tidymodels is a package for combining different machine learning models (i.e., ensemble/stacking/super learner). We have caretEnsemble for caret, and I am sure they are working on something similar for tidymodels at RStudio. Alex Hayes has a related blog post focusing on tidymodels, for those who can’t wait.

Further Resources


  1. For a list of models available via parsnip, see https://tidymodels.github.io/parsnip/articles/articles/Models.html.


comments powered by Disqus