user5750238 - 6 months ago 36

R Question

So my task is to break a dataframe of 506 observations into ten different samples of training and test sets (with replacement).

I'm doing this so I can put it through a model and see the average MSE over ten samples.

Thus far, I've got the following idiotically complicated for loop:

`temp_train<- setNames(lapply(1:10, function(x) {x <-homeprices[sample(1:nrow(homeprices),`

.8*n, replace = FALSE), ]; x }), paste0("tr_sample.", 1:10))

for (i in 1:length(temp_train)) {

assign(paste0("df_train_", i), as.data.frame(temp_train[i]))

name<-assign(paste('df_train_', i, sep=''), x[i])

temp_test<- setNames(homeprices[-name], paste0("te_sample.", 1:10))

alpha<-assign(paste0("df_test_", i), as.data.frame(temp_test[i]))

}

This for loop produces say df_test_2, which is a data frame of 506 observations of one variable. It SHOULD be a dataframe of 102 obvs of 13 variables, namely the 102 observations that are NOT in df_train_2.

My question therefore is what's a better way to do this that actually works? I would prefer to not install any packages if possible since I want to get a grasp of base r.

Answer

A common (and efficient) strategy for handling this type of task in base R is not to create each individual data frame, but to simply create a set of indices that define the partition.

For example,

```
x <- replicate(n = 10,expr = {sample(506,404)})
```

creates a matrix where each of the ten columns is filled with the row indices of a random selection of 404 rows (80% or so of 506). Then you'd loop through your model fitting and use the columns of `x`

to select the training subset of your data that you pass to your model. Negative indexing of the same indices would yield the corresponding 20% for testing.

This way you don't have tons of copies of data frames lying about.