Tutorial on tidymodels for Machine Learning

Set Up
Data Set: Diamonds
Separating Testing and Training Data: rsample
Data Pre-Processing and Feature Engineering: recipes
Defining and Fitting Models: parsnip
Summarizing Fitted Models: broom
Evaluating Model Performance: yardstick
Tuning Model Parameters: tune and dials
Summary
Further Resources
Session Info
Updates

caret is a well known R package for machine learning, which includes almost everything from data pre-processing to cross-validation. The unofficial successor of caret is tidymodels, which has a modular approach meaning that specific, smaller packages are designed to work hand in hand. Thus, tidymodels is to modeling what the tidyverse is to data wrangling. Herein, I will walk through a machine learning example from start to end and explain how to use the appropriate tidymodels packages at each place.

Set Up

Loading the tidymodels package loads a bunch of packages for modeling and also a few others from the tidyverse like ggplot2 and dplyr.

library("conflicted")
library("tidymodels")
#> -- Attaching packages ------------------------------------------------ tidymodels 0.1.0 --
#> v broom     0.5.5      v recipes   0.1.10
#> v dials     0.0.6      v rsample   0.0.6 
#> v dplyr     0.8.5      v tibble    3.0.0 
#> v ggplot2   3.3.0      v tune      0.1.0 
#> v infer     0.5.1      v workflows 0.1.1 
#> v parsnip   0.0.5      v yardstick 0.0.6 
#> v purrr     0.3.3

# Additional packages for dataviz etc.
library("ggrepel")     # for geom_label_repel()
library("corrplot")    # for corrplot()
#> corrplot 0.84 loaded

conflict_prefer("filter", "dplyr")
#> [conflicted] Will prefer dplyr::filter over any other package

ggplot2::theme_set(theme_light())

Data Set: Diamonds

With ggplot2 comes the diamonds data set, which has information on the size and quality of diamonds. Herein, we’ll use these features to predict the price of a diamond.

data("diamonds")
diamonds %>%
    sample_n(2000) %>% 
    mutate_if(is.factor, as.numeric) %>%
    select(price, everything()) %>%
    cor %>%
    {.[order(abs(.[, 1]), decreasing = TRUE), 
       order(abs(.[, 1]), decreasing = TRUE)]} %>%
    corrplot(method = "number", type = "upper", mar = c(0, 0, 1.5, 0),
             title = "Correlations between price and various features of diamonds")

Separating Testing and Training Data: rsample

First of all, we want to extract a data set for testing the predictions in the end. We’ll only use a small proportion for training (only to speed things up a little). Furthermore, the training data set will be prepared for 3-fold cross-validation (using three here to speed things up). All this is accomplished using the rsample package:

set.seed(1243)

dia_split <- initial_split(diamonds, prop = .1, strata = price)

dia_train <- training(dia_split)
dia_test  <- testing(dia_split)

dim(dia_train)
#> [1] 5395   10
dim(dia_test)
#> [1] 48545    10

dia_vfold <- vfold_cv(dia_train, v = 3, repeats = 1, strata = price)
dia_vfold %>% 
  mutate(df_ana = map(splits, analysis),
         df_ass = map(splits, assessment))
#> #  3-fold cross-validation using stratification 
#> # A tibble: 3 x 4
#>   splits              id    df_ana                df_ass               
#> * <named list>        <chr> <named list>          <named list>         
#> 1 <split [3.6K/1.8K]> Fold1 <tibble [3,596 x 10]> <tibble [1,799 x 10]>
#> 2 <split [3.6K/1.8K]> Fold2 <tibble [3,596 x 10]> <tibble [1,799 x 10]>
#> 3 <split [3.6K/1.8K]> Fold3 <tibble [3,598 x 10]> <tibble [1,797 x 10]>

Data Pre-Processing and Feature Engineering: recipes

The recipes package can be used to prepare a data set (for modeling) using different step_*() functions. For example, the plot below indicates that there may be a nonlinear relationship between price and carat, and I want to address that using higher-order terms.

qplot(carat, price, data = dia_train) +
    scale_y_continuous(trans = log_trans(), labels = function(x) round(x, -2)) +
    geom_smooth(method = "lm", formula = "y ~ poly(x, 4)") +
    labs(title = "Nonlinear relationship between price and carat of diamonds",
         subtitle = "The degree of the polynomial is a potential tuning parameter")

The recipe() takes a formula and a data set, and then the different steps are added using the appropriate step_*() functions. The recipes package comes with a ton of useful step functions (see, e.g., vignette("Simple_Example", package = "recipes")).

Herein, I want to log transform price (step_log()), I want to center and scale all numeric predictors (step_normalize()), and the categorical predictors should be dummy coded (step_dummy()). Furthermore, a quadratic effect of carat is added using step_poly().

dia_rec <-
    recipe(price ~ ., data = dia_train) %>%
    step_log(all_outcomes()) %>%
    step_normalize(all_predictors(), -all_nominal()) %>%
    step_dummy(all_nominal()) %>%
    step_poly(carat, degree = 2)

prep(dia_rec)
#> Data Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor          9
#> 
#> Training data contained 5395 data points and no missing data.
#> 
#> Operations:
#> 
#> Log transformation on price [trained]
#> Centering and scaling for carat, depth, table, x, y, z [trained]
#> Dummy variables from cut, color, clarity [trained]
#> Orthogonal polynomials on carat [trained]

Calling prep() on a recipe applies all the steps. You can now call juice() to extract the transformed data set or call bake() on a new data set.

# Note the linear and quadratic term for carat and the dummies for e.g. color
dia_juiced <- juice(prep(dia_rec))
dim(dia_juiced)
#> [1] 5395   25
names(dia_juiced)
#>  [1] "depth"        "table"        "x"            "y"            "z"           
#>  [6] "price"        "cut_1"        "cut_2"        "cut_3"        "cut_4"       
#> [11] "color_1"      "color_2"      "color_3"      "color_4"      "color_5"     
#> [16] "color_6"      "clarity_1"    "clarity_2"    "clarity_3"    "clarity_4"   
#> [21] "clarity_5"    "clarity_6"    "clarity_7"    "carat_poly_1" "carat_poly_2"

Defining and Fitting Models: parsnip

The parsnip package has wrappers around many¹ popular machine learning algorithms, and you can fit them using a unified interface. This is extremely helpful, since you have to remember only one rather then dozens of interfaces.

The models are separated into two modes/categories, namely, regression and classification (set_mode()). The model is defined using a function specific to each algorithm (e.g., linear_reg(), rand_forest()). Finally, the backend/engine/implementation is selected using set_engine().

Herein, I will start with a basic linear regression model as implemented in stats::lm().

lm_model <-
    linear_reg() %>%
    set_mode("regression") %>%
    set_engine("lm")

Furthermore, take the example of a random forest model. This could be fit using packages ranger or randomForest. Both have different interfaces (e.g., argument ntree vs. num.trees), and parsnip removes the hassle of remembering both interfaces. More general arguments pertaining to the algorithm are specified in the algorithm function (e.g., rand_forest()). Arguments specific to the engine are specified in set_engine().

rand_forest(mtry = 3, trees = 500, min_n = 5) %>%
    set_mode("regression") %>%
    set_engine("ranger", importance = "impurity_corrected")

Finally, we can fit() the model.

lm_fit1 <- fit(lm_model, price ~ ., dia_juiced)
lm_fit1
#> parsnip model object
#> 
#> Fit time:  20ms 
#> 
#> Call:
#> stats::lm(formula = formula, data = data)
#> 
#> Coefficients:
#>  (Intercept)         depth         table             x             y  
#>     7.711965      0.010871      0.005889      0.251155      0.054422  
#>            z         cut_1         cut_2         cut_3         cut_4  
#>     0.054196      0.106701     -0.026356      0.024207     -0.006191  
#>      color_1       color_2       color_3       color_4       color_5  
#>    -0.455831     -0.084108     -0.004810      0.009725     -0.005591  
#>      color_6     clarity_1     clarity_2     clarity_3     clarity_4  
#>    -0.009730      0.860961     -0.242698      0.132234     -0.052903  
#>    clarity_5     clarity_6     clarity_7  carat_poly_1  carat_poly_2  
#>     0.028996      0.002403      0.022235     51.663971    -17.508316

You can use fit() with a formula (e.g., price ~ .) or by specifying x and y. In both cases, I recommend keeping only the variables you need when preparing the data set, since this will prevent forgetting the new variable d when using y ~ a + b + c. Unnecessary variables can easily be dropped in the recipe using step_rm().

Summarizing Fitted Models: broom

Many models have implemented summary() or coef() methods. However, the output of these is usually not in a tidy format, and the broom package has the aim to resolve this issue.

glance() gives us information about the whole model. Here, R squared is pretty high and the RMSE equals 0.154.

glance(lm_fit1$fit)
#> # A tibble: 1 x 11
#>   r.squared adj.r.squared sigma statistic p.value    df logLik    AIC    BIC deviance
#>       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl>  <dbl>  <dbl>    <dbl>
#> 1     0.977         0.977 0.154     9607.       0    25  2457. -4863. -4691.     127.
#> # ... with 1 more variable: df.residual <int>

tidy() gives us information about the model parameters, and we see that we have a significant quadratic effect of carat.

tidy(lm_fit1) %>% 
  arrange(desc(abs(statistic)))
#> # A tibble: 25 x 5
#>    term         estimate std.error statistic  p.value
#>    <chr>           <dbl>     <dbl>     <dbl>    <dbl>
#>  1 (Intercept)    7.71     0.00431   1790.   0.      
#>  2 carat_poly_2 -17.5      0.259      -67.7  0.      
#>  3 clarity_1      0.861    0.0135      64.0  0.      
#>  4 color_1       -0.456    0.00738    -61.8  0.      
#>  5 carat_poly_1  51.7      1.10        47.0  0.      
#>  6 clarity_2     -0.243    0.0127     -19.2  3.63e-79
#>  7 color_2       -0.0841   0.00665    -12.7  3.50e-36
#>  8 clarity_3      0.132    0.0107      12.3  2.51e-34
#>  9 cut_1          0.107    0.00997     10.7  1.88e-26
#> 10 clarity_4     -0.0529   0.00839     -6.30 3.15e-10
#> # ... with 15 more rows

Finally, augment() can be used to get model predictions, residuals, etc.

lm_predicted <- augment(lm_fit1$fit, data = dia_juiced) %>% 
    rowid_to_column()
select(lm_predicted, rowid, price, .fitted:.std.resid)
#> # A tibble: 5,395 x 9
#>    rowid price .fitted .se.fit  .resid    .hat .sigma   .cooksd .std.resid
#>    <int> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>     <dbl>      <dbl>
#>  1     1  5.83    5.90 0.0133  -0.0769 0.00744  0.154 0.0000755     -0.502
#>  2     2  5.84    5.80 0.0110   0.0472 0.00507  0.154 0.0000193      0.307
#>  3     3  5.86    5.89 0.0138  -0.0286 0.00801  0.154 0.0000112     -0.187
#>  4     4  5.88    6.24 0.00988 -0.360  0.00413  0.154 0.000910      -2.34 
#>  5     5  6.00    6.24 0.0104  -0.244  0.00458  0.154 0.000467      -1.59 
#>  6     6  6.00    6.06 0.0100  -0.0615 0.00427  0.154 0.0000275     -0.401
#>  7     7  6.00    6.10 0.0111  -0.0999 0.00520  0.154 0.0000887     -0.651
#>  8     8  6.00    6.06 0.0104  -0.0566 0.00459  0.154 0.0000251     -0.369
#>  9     9  6.00    6.22 0.00895 -0.216  0.00339  0.154 0.000270      -1.41 
#> 10    10  6.32    6.55 0.00872 -0.233  0.00321  0.154 0.000297      -1.52 
#> # ... with 5,385 more rows

A plot of the predicted vs. actual prices shows small residuals with a few outliers, which are not well explained by the model.

ggplot(lm_predicted, aes(.fitted, price)) +
    geom_point(alpha = .2) +
    ggrepel::geom_label_repel(aes(label = rowid), 
                              data = filter(lm_predicted, abs(.resid) > 2)) +
    labs(title = "Actual vs. Predicted Price of Diamonds")

Evaluating Model Performance: yardstick

We already saw performance measures RMSE and R squared in the output of glance() above. The yardstick package is specifically designed for such measures for both numeric and categorical outcomes, and it plays well with grouped predictions (e.g., from cross-validation).

Let’s use rsample, parsnip, and yardstick for cross-validation to get a more accurate estimation of RMSE.

In the following pipeline, the model is fit() separately to the three analysis data sets, and then the fitted models are used to predict() on the three corresponding assessment data sets (i.e., 3-fold cross-validation).

Before that, analysis() and assessment() are used to extract the respective folds from dia_vfold. Furthermore, the recipe dia_rec is prepped (i.e., trained) using the analysis data of each fold, and this prepped recipe is then applied to the assessment data of each fold using bake(). Preparing the recipe separately for each fold (rather than once for the whole training data set dia_train) guards against data leakage.

The code in the following chunk makes use of list columns to store all information about the three folds in a single tibble lm_fit2, and a combination of dplyr::mutate() and purrr::map() is used to “loop” across the three rows of the tibble.

# How does dia_vfold look?
dia_vfold
#> #  3-fold cross-validation using stratification 
#> # A tibble: 3 x 2
#>   splits              id   
#>   <named list>        <chr>
#> 1 <split [3.6K/1.8K]> Fold1
#> 2 <split [3.6K/1.8K]> Fold2
#> 3 <split [3.6K/1.8K]> Fold3

# Extract analysis/training and assessment/testing data
lm_fit2 <- mutate(dia_vfold,
                  df_ana = map (splits,  analysis),
                  df_ass = map (splits,  assessment))
lm_fit2
#> #  3-fold cross-validation using stratification 
#> # A tibble: 3 x 4
#>   splits              id    df_ana                df_ass               
#> * <named list>        <chr> <named list>          <named list>         
#> 1 <split [3.6K/1.8K]> Fold1 <tibble [3,596 x 10]> <tibble [1,799 x 10]>
#> 2 <split [3.6K/1.8K]> Fold2 <tibble [3,596 x 10]> <tibble [1,799 x 10]>
#> 3 <split [3.6K/1.8K]> Fold3 <tibble [3,598 x 10]> <tibble [1,797 x 10]>

lm_fit2 <- 
    lm_fit2 %>% 
    # prep, juice, bake
    mutate(
        recipe = map (df_ana, ~prep(dia_rec, training = .x)),
        df_ana = map (recipe,  juice),
        df_ass = map2(recipe, 
                      df_ass, ~bake(.x, new_data = .y))) %>% 
    # fit
    mutate(
        model_fit  = map(df_ana, ~fit(lm_model, price ~ ., data = .x))) %>% 
    # predict
    mutate(
        model_pred = map2(model_fit, df_ass, ~predict(.x, new_data = .y)))
  
select(lm_fit2, id, recipe:model_pred)
#> # A tibble: 3 x 4
#>   id    recipe       model_fit    model_pred          
#>   <chr> <named list> <named list> <named list>        
#> 1 Fold1 <recipe>     <fit[+]>     <tibble [1,799 x 1]>
#> 2 Fold2 <recipe>     <fit[+]>     <tibble [1,799 x 1]>
#> 3 Fold3 <recipe>     <fit[+]>     <tibble [1,797 x 1]>

Note that the cross-validation code above is a bit lengthy—for didactic purposes. Wrapper functions (e.g., from the tune package) would lead to more concise code: fit_resamples() could be used here (without tuning), or tune_grid() and tune_bayes() (see below).

Now, we can extract the actual prices price from the assessment data and compare them to the predicted prices .pred. Then, the yardstick package comes into play: The function metrics() calculates three different metrics for numeric outcomes. Furthermore, it automatically recognizes that lm_preds is grouped by folds and thus calculates the metrics for each fold.

Across the three folds, we see that the RMSE is a little higher and R squared a little smaller compared to above (see output of glance(lm_fit1$fit)). This is expected, since out-of-sample prediction is harder but also way more useful.

lm_preds <- 
    lm_fit2 %>% 
    mutate(res = map2(df_ass, model_pred, ~data.frame(price = .x$price,
                                                      .pred = .y$.pred))) %>% 
    select(id, res) %>% 
    tidyr::unnest(res) %>% 
    group_by(id)
lm_preds
#> # A tibble: 5,395 x 3
#> # Groups:   id [3]
#>    id    price .pred
#>    <chr> <dbl> <dbl>
#>  1 Fold1  5.84  5.83
#>  2 Fold1  6.00  6.25
#>  3 Fold1  6.00  6.05
#>  4 Fold1  6.32  6.56
#>  5 Fold1  6.32  6.31
#>  6 Fold1  7.92  7.73
#>  7 Fold1  7.93  7.58
#>  8 Fold1  7.93  7.80
#>  9 Fold1  7.93  7.88
#> 10 Fold1  7.94  7.91
#> # ... with 5,385 more rows

metrics(lm_preds, truth = price, estimate = .pred)
#> # A tibble: 9 x 4
#>   id    .metric .estimator .estimate
#>   <chr> <chr>   <chr>          <dbl>
#> 1 Fold1 rmse    standard       0.168
#> 2 Fold2 rmse    standard       0.147
#> 3 Fold3 rmse    standard       0.298
#> 4 Fold1 rsq     standard       0.973
#> 5 Fold2 rsq     standard       0.979
#> 6 Fold3 rsq     standard       0.918
#> 7 Fold1 mae     standard       0.116
#> 8 Fold2 mae     standard       0.115
#> 9 Fold3 mae     standard       0.110

Note that metrics() has default measures for numeric and categorical outcomes, and here RMSE, R squared, and the mean absolute difference (MAE) are returned. You could also use one metric directly like rmse() or define a custom set of metrics via metric_set().

Tuning Model Parameters: tune and dials

Let’s get a little bit more involved and do some hyperparameter tuning. We turn to a different model, namely, a random forest model.

The tune package has functions for doing the actual tuning (e.g., via grid search), while all the parameters and their defaults (e.g., mtry(), neighbors()) are implemented in dials. Thus, the two packages can almost only be used in combination.

Preparing a parsnip Model for Tuning

First, I want to tune the mtry parameter of a random forest model. Thus, the model is defined using parsnip as above. However, rather than using a default value (i.e., mtry = NULL) or one specific value (i.e., mtry = 3), we use tune() as a placeholder and let cross-validation decide on the best value for mtry later on.

As the output indicates, the default minimum of mtry is 1 and the maximum depends on the data.

rf_model <- 
    rand_forest(mtry = tune()) %>%
    set_mode("regression") %>%
    set_engine("ranger")

parameters(rf_model)
#> Collection of 1 parameters for tuning
#> 
#>    id parameter type object class
#>  mtry           mtry    nparam[?]
#> 
#> Model parameters needing finalization:
#>    # Randomly Selected Predictors ('mtry')
#> 
#> See `?dials::finalize` or `?dials::update.parameters` for more information.
mtry()
#> # Randomly Selected Predictors  (quantitative)
#> Range: [1, ?]

Thus, this model is not yet ready for fitting. You can either specify the maximum for mtry yourself using update(), or you can use finalize() to let the data decide on the maximum.

rf_model %>% 
    parameters() %>% 
    update(mtry = mtry(c(1L, 5L)))
#> Collection of 1 parameters for tuning
#> 
#>    id parameter type object class
#>  mtry           mtry    nparam[+]

rf_model %>% 
    parameters() %>% 
    # Here, the maximum of mtry equals the number of predictors, i.e., 24.
    finalize(x = select(juice(prep(dia_rec)), -price)) %>% 
    pull("object")
#> [[1]]
#> # Randomly Selected Predictors  (quantitative)
#> Range: [1, 24]

Preparing Data for Tuning: recipes

The second thing I want to tune is the degree of the polynomial for the variable carat. As you saw in the plot above, polynomials up to a degree of four seemed well suited for the data. However, a simpler model might do equally well, and we want to use cross-validation to decide on the degree that works best.

Similar to tuning parameters in a model, certain aspects of a recipe can be tuned. Let’s define a second recipe and use tune() inside step_poly().

# Note that this recipe cannot be prepped (and juiced), since "degree" is a
# tuning parameter
dia_rec2 <-
    recipe(price ~ ., data = dia_train) %>%
    step_log(all_outcomes()) %>%
    step_normalize(all_predictors(), -all_nominal()) %>%
    step_dummy(all_nominal()) %>%
    step_poly(carat, degree = tune())

dia_rec2 %>% 
    parameters() %>% 
    pull("object")
#> [[1]]
#> Polynomial Degree  (quantitative)
#> Range: [1, 3]

Combine Everything: workflows

The workflows package is designed to bundle together different parts of a machine learning pipeline like a recipe or a model.

First, let’s create an initial workflow and add the recipe and the random forest model, both of which have a tuning parameter.

rf_wflow <-
    workflow() %>%
    add_model(rf_model) %>%
    add_recipe(dia_rec2)
rf_wflow
#> == Workflow ==============================================================================
#> Preprocessor: Recipe
#> Model: rand_forest()
#> 
#> -- Preprocessor --------------------------------------------------------------------------
#> 4 Recipe Steps
#> 
#> * step_log()
#> * step_normalize()
#> * step_dummy()
#> * step_poly()
#> 
#> -- Model ---------------------------------------------------------------------------------
#> Random Forest Model Specification (regression)
#> 
#> Main Arguments:
#>   mtry = tune()
#> 
#> Computational engine: ranger

Second, we need to update the parameters in rf_wflow, because the maximum of mtry is not yet known and the maximum of degree should be four (while three is the default).

rf_param <-
    rf_wflow %>%
    parameters() %>%
    update(mtry = mtry(range = c(3L, 5L)),
           degree = degree_int(range = c(2L, 4L)))
rf_param$object
#> [[1]]
#> # Randomly Selected Predictors  (quantitative)
#> Range: [3, 5]
#> 
#> [[2]]
#> Polynomial Degree  (quantitative)
#> Range: [2, 4]

Third, we want to use cross-validation for tuning, that is, to select the best combination of the hyperparameters. Bayesian optimization (see https://tidymodels.github.io/tune/) is recommended for complex tuning problems, and this can be done using tune_bayes().

Herein, however, grid search will suffice. To this end, let’s create a grid of all necessary parameter combinations.

rf_grid <- grid_regular(rf_param, levels = 3)
rf_grid
#> # A tibble: 9 x 2
#>    mtry degree
#>   <int>  <int>
#> 1     3      2
#> 2     4      2
#> 3     5      2
#> 4     3      3
#> 5     4      3
#> 6     5      3
#> 7     3      4
#> 8     4      4
#> 9     5      4

Cross-validation and hyperparameter tuning can involve fitting many models. Herein, for example, we have to fit 3 x 9 models (folds x parameter combinations). To increase speed, we can fit the models in parallel. This is directly supported by the tune package (see https://tidymodels.github.io/tune/).

library("doFuture")
all_cores <- parallel::detectCores(logical = FALSE) - 1

registerDoFuture()
cl <- makeCluster(all_cores)
plan(future::cluster, workers = cl)

Then, we can finally start tuning.

rf_search <- tune_grid(rf_wflow, grid = rf_grid, resamples = dia_vfold,
                       param_info = rf_param)

The results can be examined using autoplot() and show_best():

autoplot(rf_search, metric = "rmse") +
    labs(title = "Results of Grid Search for Two Tuning Parameters of a Random Forest")

show_best(rf_search, "rmse", n = 9)
#> # A tibble: 9 x 7
#>    mtry degree .metric .estimator  mean     n std_err
#>   <int>  <int> <chr>   <chr>      <dbl> <int>   <dbl>
#> 1     5      2 rmse    standard   0.121     3 0.00498
#> 2     5      3 rmse    standard   0.121     3 0.00454
#> 3     4      2 rmse    standard   0.122     3 0.00463
#> 4     5      4 rmse    standard   0.122     3 0.00471
#> 5     4      3 rmse    standard   0.123     3 0.00469
#> 6     4      4 rmse    standard   0.124     3 0.00496
#> 7     3      3 rmse    standard   0.128     3 0.00502
#> 8     3      2 rmse    standard   0.128     3 0.00569
#> 9     3      4 rmse    standard   0.128     3 0.00501

select_best(rf_search, metric = "rmse")
#> # A tibble: 1 x 2
#>    mtry degree
#>   <int>  <int>
#> 1     5      2

select_by_one_std_err(rf_search, mtry, degree, metric = "rmse")
#> # A tibble: 1 x 9
#>    mtry degree .metric .estimator  mean     n std_err .best .bound
#>   <int>  <int> <chr>   <chr>      <dbl> <int>   <dbl> <dbl>  <dbl>
#> 1     4      2 rmse    standard   0.122     3 0.00463 0.121  0.126

With a cross-validation RMSE of ca. 0.12, the random forest model seems to outperform the linear regression from above. Furthermore, 0.12 is (hopefully) a realistic estimate of the out-of-sample error.

Selecting the Best Model to Make the Final Predictions

We saw above that a quadratic trend was enough to get a good model. Furthermore, cross-validation revealed that mtry = 4 seems to perform well.

To use this combination of hyperparameters, we fit() the corresponding model (or workflow, more precisely) on the whole training data set dia_train.

rf_param_final <- select_by_one_std_err(rf_search, mtry, degree,
                                        metric = "rmse")

rf_wflow_final <- finalize_workflow(rf_wflow, rf_param_final)

rf_wflow_final_fit <- fit(rf_wflow_final, data = dia_train)

Now, we want to use this to predict() on data never seen before, namely, dia_test. Unfortunately, predict(rf_wflow_final_fit, new_data = dia_test) does not work in the present case, because the outcome is modified in the recipe via step_log().²

Thus, we need a little workaround: The prepped recipe is extracted from the workflow, and this can then be used to bake() the testing data. This baked data set together with the extracted model can then be used for the final predictions.

dia_rec3     <- pull_workflow_prepped_recipe(rf_wflow_final_fit)
rf_final_fit <- pull_workflow_fit(rf_wflow_final_fit)

dia_test$.pred <- predict(rf_final_fit, 
                          new_data = bake(dia_rec3, dia_test))$.pred
dia_test$logprice <- log(dia_test$price)

metrics(dia_test, truth = logprice, estimate = .pred)
#> # A tibble: 3 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rmse    standard      0.113 
#> 2 rsq     standard      0.988 
#> 3 mae     standard      0.0846

As you can see, we get an RMSE of 0.11 on the testing data, which is even slightly better compared to the cross-validation RMSE.

Summary

The tidymodels ecosystem bundles together a set of packages that work hand in hand to solve machine-learning problems from start to end. Together with the data-wrangling facilities in the tidyverse and the plotting tools from ggplot2, this makes for a rich toolbox for every data scientist working with R.

The only thing that is definitely missing in tidymodels is a package for combining different machine learning models (i.e., ensemble/stacking/super learner). We have caretEnsemble for caret, and I am sure they are working on something similar for tidymodels at RStudio. Alex Hayes has a related blog post focusing on tidymodels, for those who can’t wait.

Further Resources

For further information about each of the tidymodels packages, I recommend the vignettes/articles on the respective package homepage (e.g., https://tidymodels.github.io/recipes/ or https://tidymodels.github.io/tune/).
Max Kuhn, one of the developer of tidymodels packages, was interviewed on the R podcast and on the DataFramed podcast.
Max Kuhn is the author of the books Applied Predictive Modeling (with Kjell Johnson) and The caret Package.
R for Data Science by Hadley Wickham and Garrett Grolemund covers all the basics of data import, transformation, visualization, and modeling using tidyverse and tidymodels packages.
Variable importance (plots) are provided by the package vip, which works well in combination with tidymodels packages.
Recipe steps for dealing with unbalanced data are provided by the themis package.
There are a few more tidymodels packages that I did not cover herein, like infer or tidytext. Read more about these at https://tidymodels.github.io/tidymodels/.

Session Info

sessioninfo::session_info()
#> - Session info -------------------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.6.2 (2019-12-12)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  German_Germany.1252         
#>  ctype    German_Germany.1252         
#>  tz       Europe/Berlin               
#>  date     2020-04-07                  
#> 
#> - Packages -----------------------------------------------------------------------------
#>  package       * version     date       lib source                      
#>  assertthat      0.2.1       2019-03-21 [1] CRAN (R 3.6.2)              
#>  backports       1.1.6       2020-04-05 [1] CRAN (R 3.6.2)              
#>  base64enc       0.1-3       2015-07-28 [1] CRAN (R 3.6.0)              
#>  bayesplot       1.7.1       2019-12-01 [1] CRAN (R 3.6.2)              
#>  blogdown        0.18        2020-03-04 [1] CRAN (R 3.6.3)              
#>  bookdown        0.18        2020-03-05 [1] CRAN (R 3.6.3)              
#>  boot            1.3-23      2019-07-05 [2] CRAN (R 3.6.2)              
#>  broom         * 0.5.5       2020-02-29 [1] CRAN (R 3.6.3)              
#>  callr           3.4.3       2020-03-28 [1] CRAN (R 3.6.2)              
#>  class           7.3-15      2019-01-01 [2] CRAN (R 3.6.2)              
#>  cli             2.0.2       2020-02-28 [1] CRAN (R 3.6.3)              
#>  codetools       0.2-16      2018-12-24 [2] CRAN (R 3.6.2)              
#>  colorspace      1.4-1       2019-03-18 [1] CRAN (R 3.6.1)              
#>  colourpicker    1.0         2017-09-27 [1] CRAN (R 3.6.2)              
#>  conflicted    * 1.0.4       2019-06-21 [1] CRAN (R 3.6.3)              
#>  corrplot      * 0.84        2017-10-16 [1] CRAN (R 3.6.3)              
#>  crayon          1.3.4       2017-09-16 [1] CRAN (R 3.6.2)              
#>  crosstalk       1.1.0.1     2020-03-13 [1] CRAN (R 3.6.3)              
#>  dials         * 0.0.6       2020-04-03 [1] CRAN (R 3.6.2)              
#>  DiceDesign      1.8-1       2019-07-31 [1] CRAN (R 3.6.2)              
#>  digest          0.6.25      2020-02-23 [1] CRAN (R 3.6.3)              
#>  dplyr         * 0.8.5       2020-03-07 [1] CRAN (R 3.6.3)              
#>  DT              0.13        2020-03-23 [1] CRAN (R 3.6.3)              
#>  dygraphs        1.1.1.6     2018-07-11 [1] CRAN (R 3.6.2)              
#>  ellipsis        0.3.0       2019-09-20 [1] CRAN (R 3.6.2)              
#>  evaluate        0.14        2019-05-28 [1] CRAN (R 3.6.2)              
#>  fansi           0.4.1       2020-01-08 [1] CRAN (R 3.6.2)              
#>  farver          2.0.3       2020-01-16 [1] CRAN (R 3.6.2)              
#>  fastmap         1.0.1       2019-10-08 [1] CRAN (R 3.6.2)              
#>  foreach         1.5.0       2020-03-30 [1] CRAN (R 3.6.2)              
#>  furrr           0.1.0       2018-05-16 [1] CRAN (R 3.6.2)              
#>  future          1.16.0      2020-01-16 [1] CRAN (R 3.6.2)              
#>  generics        0.0.2       2018-11-29 [1] CRAN (R 3.6.2)              
#>  ggplot2       * 3.3.0       2020-03-05 [1] CRAN (R 3.6.3)              
#>  ggrepel       * 0.8.2       2020-03-08 [1] CRAN (R 3.6.3)              
#>  ggridges        0.5.2       2020-01-12 [1] CRAN (R 3.6.2)              
#>  globals         0.12.5      2019-12-07 [1] CRAN (R 3.6.1)              
#>  glue            1.4.0       2020-04-03 [1] CRAN (R 3.6.2)              
#>  gower           0.2.1       2019-05-14 [1] CRAN (R 3.6.1)              
#>  GPfit           1.0-8       2019-02-08 [1] CRAN (R 3.6.2)              
#>  gridExtra       2.3         2017-09-09 [1] CRAN (R 3.6.2)              
#>  gtable          0.3.0       2019-03-25 [1] CRAN (R 3.6.2)              
#>  gtools          3.8.2       2020-03-31 [1] CRAN (R 3.6.3)              
#>  hardhat         0.1.2       2020-02-28 [1] CRAN (R 3.6.3)              
#>  htmltools       0.4.0       2019-10-04 [1] CRAN (R 3.6.2)              
#>  htmlwidgets     1.5.1       2019-10-08 [1] CRAN (R 3.6.2)              
#>  httpuv          1.5.2       2019-09-11 [1] CRAN (R 3.6.2)              
#>  igraph          1.2.5       2020-03-19 [1] CRAN (R 3.6.3)              
#>  infer         * 0.5.1       2019-11-19 [1] CRAN (R 3.6.2)              
#>  inline          0.3.15      2018-05-18 [1] CRAN (R 3.6.2)              
#>  ipred           0.9-9       2019-04-28 [1] CRAN (R 3.6.2)              
#>  iterators       1.0.12      2019-07-26 [1] CRAN (R 3.6.2)              
#>  janeaustenr     0.1.5       2017-06-10 [1] CRAN (R 3.6.2)              
#>  knitr           1.28        2020-02-06 [1] CRAN (R 3.6.2)              
#>  labeling        0.3         2014-08-23 [1] CRAN (R 3.6.0)              
#>  later           1.0.0       2019-10-04 [1] CRAN (R 3.6.2)              
#>  lattice         0.20-38     2018-11-04 [2] CRAN (R 3.6.2)              
#>  lava            1.6.7       2020-03-05 [1] CRAN (R 3.6.3)              
#>  lhs             1.0.1       2019-02-03 [1] CRAN (R 3.6.2)              
#>  lifecycle       0.2.0       2020-03-06 [1] CRAN (R 3.6.3)              
#>  listenv         0.8.0       2019-12-05 [1] CRAN (R 3.6.2)              
#>  lme4            1.1-21      2019-03-05 [1] CRAN (R 3.6.2)              
#>  loo             2.2.0       2019-12-19 [1] CRAN (R 3.6.2)              
#>  lubridate       1.7.4       2018-04-11 [1] CRAN (R 3.6.2)              
#>  magrittr        1.5         2014-11-22 [1] CRAN (R 3.6.2)              
#>  markdown        1.1         2019-08-07 [1] CRAN (R 3.6.2)              
#>  MASS            7.3-51.4    2019-03-31 [2] CRAN (R 3.6.2)              
#>  Matrix          1.2-18      2019-11-27 [2] CRAN (R 3.6.2)              
#>  matrixStats     0.56.0      2020-03-13 [1] CRAN (R 3.6.3)              
#>  memoise         1.1.0       2017-04-21 [1] CRAN (R 3.6.2)              
#>  mgcv            1.8-31      2019-11-09 [2] CRAN (R 3.6.2)              
#>  mime            0.9         2020-02-04 [1] CRAN (R 3.6.2)              
#>  miniUI          0.1.1.1     2018-05-18 [1] CRAN (R 3.6.2)              
#>  minqa           1.2.4       2014-10-09 [1] CRAN (R 3.6.2)              
#>  munsell         0.5.0       2018-06-12 [1] CRAN (R 3.6.2)              
#>  nlme            3.1-142     2019-11-07 [2] CRAN (R 3.6.2)              
#>  nloptr          1.2.2.1     2020-03-11 [1] CRAN (R 3.6.3)              
#>  nnet            7.3-12      2016-02-02 [2] CRAN (R 3.6.2)              
#>  parsnip       * 0.0.5       2020-01-07 [1] CRAN (R 3.6.2)              
#>  pillar          1.4.3       2019-12-20 [1] CRAN (R 3.6.2)              
#>  pkgbuild        1.0.6       2019-10-09 [1] CRAN (R 3.6.2)              
#>  pkgconfig       2.0.3       2019-09-22 [1] CRAN (R 3.6.2)              
#>  plyr            1.8.6       2020-03-03 [1] CRAN (R 3.6.3)              
#>  prettyunits     1.1.1       2020-01-24 [1] CRAN (R 3.6.2)              
#>  pROC            1.16.2      2020-03-19 [1] CRAN (R 3.6.3)              
#>  processx        3.4.2       2020-02-09 [1] CRAN (R 3.6.2)              
#>  prodlim         2019.11.13  2019-11-17 [1] CRAN (R 3.6.2)              
#>  promises        1.1.0       2019-10-04 [1] CRAN (R 3.6.2)              
#>  ps              1.3.2       2020-02-13 [1] CRAN (R 3.6.2)              
#>  purrr         * 0.3.3       2019-10-18 [1] CRAN (R 3.6.2)              
#>  R6              2.4.1       2019-11-12 [1] CRAN (R 3.6.2)              
#>  ranger          0.12.1      2020-01-10 [1] CRAN (R 3.6.2)              
#>  Rcpp            1.0.4       2020-03-17 [1] CRAN (R 3.6.3)              
#>  recipes       * 0.1.10      2020-03-18 [1] CRAN (R 3.6.3)              
#>  reshape2        1.4.3       2017-12-11 [1] CRAN (R 3.6.2)              
#>  rlang           0.4.5.9000  2020-03-25 [1] Github (r-lib/rlang@a90b04b)
#>  rmarkdown       2.1         2020-01-20 [1] CRAN (R 3.6.2)              
#>  rpart           4.1-15      2019-04-12 [2] CRAN (R 3.6.2)              
#>  rsample       * 0.0.6       2020-03-31 [1] CRAN (R 3.6.2)              
#>  rsconnect       0.8.16      2019-12-13 [1] CRAN (R 3.6.2)              
#>  rstan           2.19.3      2020-02-11 [1] CRAN (R 3.6.2)              
#>  rstanarm        2.19.3      2020-02-11 [1] CRAN (R 3.6.2)              
#>  rstantools      2.0.0       2019-09-15 [1] CRAN (R 3.6.2)              
#>  rstudioapi      0.11        2020-02-07 [1] CRAN (R 3.6.2)              
#>  scales        * 1.1.0       2019-11-18 [1] CRAN (R 3.6.2)              
#>  sessioninfo     1.1.1       2018-11-05 [1] CRAN (R 3.6.2)              
#>  shiny           1.4.0.2     2020-03-13 [1] CRAN (R 3.6.3)              
#>  shinyjs         1.1         2020-01-13 [1] CRAN (R 3.6.2)              
#>  shinystan       2.5.0       2018-05-01 [1] CRAN (R 3.6.2)              
#>  shinythemes     1.1.2       2018-11-06 [1] CRAN (R 3.6.2)              
#>  SnowballC       0.7.0       2020-04-01 [1] CRAN (R 3.6.3)              
#>  StanHeaders     2.21.0-1    2020-01-19 [1] CRAN (R 3.6.2)              
#>  stringi         1.4.6       2020-02-17 [1] CRAN (R 3.6.2)              
#>  stringr         1.4.0       2019-02-10 [1] CRAN (R 3.6.2)              
#>  survival        3.1-8       2019-12-03 [2] CRAN (R 3.6.2)              
#>  threejs         0.3.3       2020-01-21 [1] CRAN (R 3.6.2)              
#>  tibble        * 3.0.0       2020-03-30 [1] CRAN (R 3.6.2)              
#>  tidymodels    * 0.1.0       2020-02-16 [1] CRAN (R 3.6.2)              
#>  tidyposterior   0.0.2       2018-11-15 [1] CRAN (R 3.6.2)              
#>  tidypredict     0.4.5       2020-02-10 [1] CRAN (R 3.6.2)              
#>  tidyr           1.0.2       2020-01-24 [1] CRAN (R 3.6.2)              
#>  tidyselect      1.0.0       2020-01-27 [1] CRAN (R 3.6.2)              
#>  tidytext        0.2.3       2020-03-04 [1] CRAN (R 3.6.3)              
#>  timeDate        3043.102    2018-02-21 [1] CRAN (R 3.6.2)              
#>  tokenizers      0.2.1       2018-03-29 [1] CRAN (R 3.6.2)              
#>  tune          * 0.1.0       2020-04-02 [1] CRAN (R 3.6.3)              
#>  utf8            1.1.4       2018-05-24 [1] CRAN (R 3.6.2)              
#>  vctrs           0.2.99.9010 2020-04-03 [1] Github (r-lib/vctrs@5c69793)
#>  withr           2.1.2       2018-03-15 [1] CRAN (R 3.6.2)              
#>  workflows     * 0.1.1       2020-03-17 [1] CRAN (R 3.6.3)              
#>  xfun            0.12        2020-01-13 [1] CRAN (R 3.6.2)              
#>  xtable          1.8-4       2019-04-21 [1] CRAN (R 3.6.2)              
#>  xts             0.12-0      2020-01-19 [1] CRAN (R 3.6.2)              
#>  yaml            2.2.1       2020-02-01 [1] CRAN (R 3.6.2)              
#>  yardstick     * 0.0.6       2020-03-17 [1] CRAN (R 3.6.3)              
#>  zoo             1.8-7       2020-01-10 [1] CRAN (R 3.6.2)              
#> 
#> [1] C:/Users/hp/Documents/R/win-library/3.6
#> [2] C:/Program Files/R/R-3.6.2/library

Updates

This blog post was modified on 2012-04-02. In the previous version, cross-validation in the section Evaluating Model Performance above was done using a recipe that was prepped on the whole training data. This could potentially lead to data leakage, as pointed out by Aaron R. Williams. Thus, this was changed and the new version uses a recipe that is prepped separately for each fold.

For a list of models available via parsnip, see https://tidymodels.github.io/parsnip/articles/articles/Models.html.↩︎
Using (a workflow with) a recipe, which modifies the outcome, for prediction would require skipping the respective recipe step at bake() time via step_log(skip = TRUE). While this is a good idea for predicting on the test data, it is a bad idea for predicting on the assessment data during cross-validation (see also https://github.com/tidymodels/workflows/issues/31).↩︎

data-science r tutorial machine-learning