This function provides a list of row indices used for k-fold cross-validation (basic, stratified, grouped, or blocked). Repeated fold creation is supported as well. By default, in-sample indices are returned.
create_folds(
y,
k = 5L,
type = c("stratified", "basic", "grouped", "blocked"),
n_bins = 10L,
m_rep = 1L,
use_names = TRUE,
invert = FALSE,
shuffle = FALSE,
seed = NULL
)
Either the variable used for "stratification" or "grouped" splits. For other types of splits, any vector of the same length as the data intended to split.
Number of folds.
Split type. One of "stratified" (default), "basic", "grouped", "blocked".
Approximate numbers of bins for numeric y
(only for type = "stratified"
).
How many times should the data be split into k folds? Default is 1, i.e., no repetitions.
Should folds be named? Default is TRUE
.
Set to TRUE
in order to receive out-of-sample indices.
Default is FALSE
, i.e., in-sample indices are returned.
Should row indices be randomly shuffled within folds?
Default is FALSE
.
Integer random seed.
If invert = FALSE
(the default), a list with in-sample row indices.
If invert = TRUE
, a list with out-of-sample indices.
By default, the function uses stratified splitting. This will balance the folds
regarding the distribution of the input vector y
.
(Numeric input is first binned into n_bins
quantile groups.)
If type = "grouped"
, groups specified by y
are kept together
when splitting. This is relevant for clustered or panel data.
In contrast to basic splitting, type = "blocked"
does not sample
indices at random, but rather keeps them in sequential groups.
y <- rep(c(letters[1:4]), each = 5)
create_folds(y)
#> $Fold1
#> [1] 1 2 3 5 6 7 9 10 11 12 13 15 16 18 19 20
#>
#> $Fold2
#> [1] 1 2 3 4 6 8 9 10 11 13 14 15 16 17 18 20
#>
#> $Fold3
#> [1] 1 3 4 5 6 7 8 10 11 12 13 14 17 18 19 20
#>
#> $Fold4
#> [1] 1 2 4 5 7 8 9 10 11 12 14 15 16 17 19 20
#>
#> $Fold5
#> [1] 2 3 4 5 6 7 8 9 12 13 14 15 16 17 18 19
#>
create_folds(y, k = 2)
#> $Fold1
#> [1] 2 3 4 8 9 13 14 15 17 18 20
#>
#> $Fold2
#> [1] 1 5 6 7 10 11 12 16 19
#>
create_folds(y, k = 2, m_rep = 2)
#> $Fold1.Rep1
#> [1] 4 5 8 9 10 11 13 14 18 19
#>
#> $Fold2.Rep1
#> [1] 1 2 3 6 7 12 15 16 17 20
#>
#> $Fold1.Rep2
#> [1] 1 2 6 10 12 13 16 19
#>
#> $Fold2.Rep2
#> [1] 3 4 5 7 8 9 11 14 15 17 18 20
#>
create_folds(y, k = 3, type = "blocked")
#> $Fold1
#> [1] 8 9 10 11 12 13 14 15 16 17 18 19 20
#>
#> $Fold2
#> [1] 1 2 3 4 5 6 7 15 16 17 18 19 20
#>
#> $Fold3
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14
#>