Danib90 Danib90 - 1 year ago 78
R Question

Random stratified sampling with different proportions

I am trying to split a dataset in 80/20 - training and testing sets. I am trying to split by location, which is a factor with 4 levels, however each level has not been sampled equally. Out of 1892 samples -

Location1: 172

Location2: 615

Location3: 603

Location4: 502

I am trying to split the whole dataset 80/20, as mentioned above, but I also want each location to be split 80/20 so that I get an even proportion from each location in the training and testing set. I've seen one post about this using

function from the
package but it doesn't seem to want to split my factors up.

Here is a simplified reproducible example -

x <- c(1, 2, 3, 4, 1, 3, 7, 4, 5, 7, 8, 9, 4, 6, 7, 9, 7, 1, 5, 6)

xx <- c("A", "A", "B", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C", "D", "D", "D", "D", "D")

df <- data.frame(x, xx)

validIndex <- stratified(df, "xx", size=16/nrow(df))

valid <- df[-validIndex,]

train <- df[validIndex,]

correspond to the factors in the approximate proportions as the actual dataset (~ 10, 32, 32, and 26%, respectively)

Answer Source

Using bothSets should return you a list containing the split of the original data frame into validation and training set (whose union should be the original data frame):

splt <- stratified(df, "xx", size=16/nrow(df), replace=FALSE, bothSets=TRUE)
valid <- splt[[1]]
train <- splt[[2]]

## check
df2 <- as.data.frame(do.call("rbind",splt))
all.equal(df[with(df, order(xx, x)), ],
          df2[with(df2, order(xx, x)), ],