This function provides a list of row indices used for k-fold cross-validation (basic, stratified, grouped, or blocked). Repeated fold creation is supported as well. By default, in-sample indices are returned.

create_folds(
  y,
  k = 5L,
  type = c("stratified", "basic", "grouped", "blocked"),
  n_bins = 10L,
  m_rep = 1L,
  use_names = TRUE,
  invert = FALSE,
  shuffle = FALSE,
  seed = NULL
)

Arguments

y

Either the variable used for "stratification" or "grouped" splits. For other types of splits, any vector of the same length as the data intended to split.

k

Number of folds.

type

Split type. One of "stratified" (default), "basic", "grouped", "blocked".

n_bins

Approximate numbers of bins for numeric y (only for type = "stratified").

m_rep

How many times should the data be split into k folds? Default is 1, i.e., no repetitions.

use_names

Should folds be named? Default is TRUE.

invert

Set to TRUE in order to receive out-of-sample indices. Default is FALSE, i.e., in-sample indices are returned.

shuffle

Should row indices be randomly shuffled within folds? Default is FALSE.

seed

Integer random seed.

Value

If invert = FALSE (the default), a list with in-sample row indices. If invert = TRUE, a list with out-of-sample indices.

Details

By default, the function uses stratified splitting. This will balance the folds regarding the distribution of the input vector y. (Numeric input is first binned into n_bins quantile groups.) If type = "grouped", groups specified by y are kept together when splitting. This is relevant for clustered or panel data. In contrast to basic splitting, type = "blocked" does not sample indices at random, but rather keeps them in sequential groups.

Examples

y <- rep(c(letters[1:4]), each = 5)
create_folds(y)
#> $Fold1
#>  [1]  1  2  3  5  6  7  9 10 11 12 13 15 16 18 19 20
#> 
#> $Fold2
#>  [1]  1  2  3  4  6  8  9 10 11 13 14 15 16 17 18 20
#> 
#> $Fold3
#>  [1]  1  3  4  5  6  7  8 10 11 12 13 14 17 18 19 20
#> 
#> $Fold4
#>  [1]  1  2  4  5  7  8  9 10 11 12 14 15 16 17 19 20
#> 
#> $Fold5
#>  [1]  2  3  4  5  6  7  8  9 12 13 14 15 16 17 18 19
#> 
create_folds(y, k = 2)
#> $Fold1
#>  [1]  2  3  4  8  9 13 14 15 17 18 20
#> 
#> $Fold2
#> [1]  1  5  6  7 10 11 12 16 19
#> 
create_folds(y, k = 2, m_rep = 2)
#> $Fold1.Rep1
#>  [1]  4  5  8  9 10 11 13 14 18 19
#> 
#> $Fold2.Rep1
#>  [1]  1  2  3  6  7 12 15 16 17 20
#> 
#> $Fold1.Rep2
#> [1]  1  2  6 10 12 13 16 19
#> 
#> $Fold2.Rep2
#>  [1]  3  4  5  7  8  9 11 14 15 17 18 20
#> 
create_folds(y, k = 3, type = "blocked")
#> $Fold1
#>  [1]  8  9 10 11 12 13 14 15 16 17 18 19 20
#> 
#> $Fold2
#>  [1]  1  2  3  4  5  6  7 15 16 17 18 19 20
#> 
#> $Fold3
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14
#>