user5750238 user5750238 -4 years ago 146
R Question

R take ten unique samples and break into training/test sets?

So my task is to break a dataframe of 506 observations into ten different samples of training and test sets (with replacement).
I'm doing this so I can put it through a model and see the average MSE over ten samples.
Thus far, I've got the following idiotically complicated for loop:

temp_train<- setNames(lapply(1:10, function(x) {x <-homeprices[sample(1:nrow(homeprices),
.8*n, replace = FALSE), ]; x }), paste0("tr_sample.", 1:10))
for (i in 1:length(temp_train)) {
assign(paste0("df_train_", i),[i]))
name<-assign(paste('df_train_', i, sep=''), x[i])
temp_test<- setNames(homeprices[-name], paste0("te_sample.", 1:10))
alpha<-assign(paste0("df_test_", i),[i]))

This for loop produces say df_test_2, which is a data frame of 506 observations of one variable. It SHOULD be a dataframe of 102 obvs of 13 variables, namely the 102 observations that are NOT in df_train_2.
My question therefore is what's a better way to do this that actually works? I would prefer to not install any packages if possible since I want to get a grasp of base r.

Answer Source

A common (and efficient) strategy for handling this type of task in base R is not to create each individual data frame, but to simply create a set of indices that define the partition.

For example,

x <- replicate(n = 10,expr = {sample(506,404)})

creates a matrix where each of the ten columns is filled with the row indices of a random selection of 404 rows (80% or so of 506). Then you'd loop through your model fitting and use the columns of x to select the training subset of your data that you pass to your model. Negative indexing of the same indices would yield the corresponding 20% for testing.

This way you don't have tons of copies of data frames lying about.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download