This function provides row indices for data splitting, e.g., to split data into training, validation, and test. Different types of split strategies are supported, see Details. The partition indices are either returned as list with one element per partition (the default) or as vector of partition IDs.
partition(
y,
p,
type = c("stratified", "basic", "grouped", "blocked"),
n_bins = 10L,
split_into_list = TRUE,
use_names = TRUE,
shuffle = FALSE,
seed = NULL
)
Either the variable used for "stratification" or "grouped" splits. For other types of splits, any vector of the same length as the data intended to split.
A vector with split probabilities per partition, e.g.,
c(train = 0.7, valid = 0.3)
. Names are passed to the output.
Split type. One of "stratified" (default), "basic", "grouped", "blocked".
Approximate numbers of bins for numeric y
(only for type = "stratified"
).
Should the resulting partition vector be split into a list?
Default is TRUE
.
Should names of p
be used as partition names?
Default is TRUE
.
Should row indices be randomly shuffled within partition?
Default is FALSE
. Shuffling is only possible when split_into_list = TRUE
.
Integer random seed.
A list with row indices per partition (if split_into_list = TRUE
)
or a vector of partition IDs.
By default, the function uses stratified splitting. This will balance the partitions
as good as possible regarding the distribution of the input vector y
.
(Numeric input is first binned into n_bins
quantile groups.)
If type = "grouped"
, groups specified by y
are kept together when
splitting. This is relevant for clustered or panel data.
In contrast to basic splitting, type = "blocked"
does not sample indices
at random, but rather keeps them in groups: e.g., the first 80% of observations form
a training set and the remaining 20% are used for testing.
y <- rep(c(letters[1:4]), each = 5)
partition(y, p = c(0.7, 0.3), seed = 1)
#> $`1`
#> [1] 1 2 3 5 7 8 9 10 11 12 14 15 17 18 19 20
#>
#> $`2`
#> [1] 4 6 13 16
#>
partition(y, p = c(0.7, 0.3), split_into_list = FALSE, seed = 1)
#> [1] 1 1 1 2 1 2 1 1 1 1 1 1 2 1 1 2 1 1 1 1
p <- c(train = 0.8, valid = 0.1, test = 0.1)
partition(y, p, seed = 1)
#> $train
#> [1] 1 2 3 5 7 8 9 10 11 12 14 15 17 18 19 20
#>
#> $valid
#> [1] 6 13
#>
#> $test
#> [1] 4 16
#>
partition(y, p, split_into_list = FALSE, seed = 1)
#> [1] train train train test train valid train train train train train train
#> [13] valid train train test train train train train
#> Levels: train valid test
partition(y, p, split_into_list = FALSE, use_names = FALSE, seed = 1)
#> [1] 1 1 1 3 1 2 1 1 1 1 1 1 2 1 1 3 1 1 1 1
partition(y, p = c(0.7, 0.3), type = "grouped")
#> $`1`
#> [1] 6 7 8 9 10 16 17 18 19 20
#>
#> $`2`
#> [1] 1 2 3 4 5 11 12 13 14 15
#>
partition(y, p = c(0.7, 0.3), type = "blocked")
#> $`1`
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14
#>
#> $`2`
#> [1] 15 16 17 18 19 20
#>