Danib90 - 1 year ago 142
R Question

# Random stratified sampling with different proportions

I am trying to split a dataset in 80/20 - training and testing sets. I am trying to split by location, which is a factor with 4 levels, however each level has not been sampled equally. Out of 1892 samples -

Location1: 172

Location2: 615

Location3: 603

Location4: 502

I am trying to split the whole dataset 80/20, as mentioned above, but I also want each location to be split 80/20 so that I get an even proportion from each location in the training and testing set. I've seen one post about this using

`stratified`
function from the
`splitstackshape`
package but it doesn't seem to want to split my factors up.

Here is a simplified reproducible example -

`x <- c(1, 2, 3, 4, 1, 3, 7, 4, 5, 7, 8, 9, 4, 6, 7, 9, 7, 1, 5, 6)`

`xx <- c("A", "A", "B", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C", "D", "D", "D", "D", "D")`

`df <- data.frame(x, xx)`

`validIndex <- stratified(df, "xx", size=16/nrow(df))`

`valid <- df[-validIndex,]`

`train <- df[validIndex,]`

where
`A`
,
`B`
,
`C`
,
`D`
correspond to the factors in the approximate proportions as the actual dataset (~ 10, 32, 32, and 26%, respectively)

Using `bothSets` should return you a list containing the split of the original data frame into validation and training set (whose union should be the original data frame):

``````splt <- stratified(df, "xx", size=16/nrow(df), replace=FALSE, bothSets=TRUE)
valid <- splt[[1]]
train <- splt[[2]]

## check
df2 <- as.data.frame(do.call("rbind",splt))
all.equal(df[with(df, order(xx, x)), ],
df2[with(df2, order(xx, x)), ],
check.names=FALSE)
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download