Split Data into Partitions

This function provides row indices for data splitting, e.g., to split data into training, validation, and test. Different types of split strategies are supported, see Details. The partition indices are either returned as list with one element per partition (the default) or as vector of partition IDs.

partition(
  y,
  p,
  type = c("stratified", "basic", "grouped", "blocked"),
  n_bins = 10L,
  split_into_list = TRUE,
  use_names = TRUE,
  shuffle = FALSE,
  seed = NULL
)

Arguments

y: Either the variable used for "stratification" or "grouped" splits. For other types of splits, any vector of the same length as the data intended to split.
p: A vector with split probabilities per partition, e.g., c(train = 0.7, valid = 0.3). Names are passed to the output.
type: Split type. One of "stratified" (default), "basic", "grouped", "blocked".
n_bins: Approximate numbers of bins for numeric y (only for type = "stratified").
split_into_list: Should the resulting partition vector be split into a list? Default is TRUE.
use_names: Should names of p be used as partition names? Default is TRUE.
shuffle: Should row indices be randomly shuffled within partition? Default is FALSE. Shuffling is only possible when split_into_list = TRUE.
seed: Integer random seed.

Value

A list with row indices per partition (if split_into_list = TRUE) or a vector of partition IDs.

Details

By default, the function uses stratified splitting. This will balance the partitions as good as possible regarding the distribution of the input vector y. (Numeric input is first binned into n_bins quantile groups.) If type = "grouped", groups specified by y are kept together when splitting. This is relevant for clustered or panel data. In contrast to basic splitting, type = "blocked" does not sample indices at random, but rather keeps them in groups: e.g., the first 80% of observations form a training set and the remaining 20% are used for testing.

Examples

y <- rep(c(letters[1:4]), each = 5)
partition(y, p = c(0.7, 0.3), seed = 1)
#> $`1`
#>  [1]  1  2  3  5  7  8  9 10 11 12 14 15 17 18 19 20
#> 
#> $`2`
#> [1]  4  6 13 16
#> 
partition(y, p = c(0.7, 0.3), split_into_list = FALSE, seed = 1)
#>  [1] 1 1 1 2 1 2 1 1 1 1 1 1 2 1 1 2 1 1 1 1
p <- c(train = 0.8, valid = 0.1, test = 0.1)
partition(y, p, seed = 1)
#> $train
#>  [1]  1  2  3  5  7  8  9 10 11 12 14 15 17 18 19 20
#> 
#> $valid
#> [1]  6 13
#> 
#> $test
#> [1]  4 16
#> 
partition(y, p, split_into_list = FALSE, seed = 1)
#>  [1] train train train test  train valid train train train train train train
#> [13] valid train train test  train train train train
#> Levels: train valid test
partition(y, p, split_into_list = FALSE, use_names = FALSE, seed = 1)
#>  [1] 1 1 1 3 1 2 1 1 1 1 1 1 2 1 1 3 1 1 1 1
partition(y, p = c(0.7, 0.3), type = "grouped")
#> $`1`
#>  [1]  6  7  8  9 10 16 17 18 19 20
#> 
#> $`2`
#>  [1]  1  2  3  4  5 11 12 13 14 15
#> 
partition(y, p = c(0.7, 0.3), type = "blocked")
#> $`1`
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14
#> 
#> $`2`
#> [1] 15 16 17 18 19 20
#>

Arguments

Value

Details

See also

Examples