Jonathan Dunne Jonathan Dunne - 22 days ago 5
R Question

How can I split large data.frame into smaller ones without using a loop?

I have a very large dataframe (20k rows) The dataframe basically contains a date / timestamp some text and a delta between the first timestamp and subsequent time stamps.

date text time.diff
1 2016-03-09 15:50:07 Text 1 0.000
2 2016-03-09 15:50:10 Text 2 2.808
3 2016-03-09 15:50:17 Text 3 10.128
4 2016-03-09 15:50:53 Text 4 45.952
5 2016-03-09 21:26:15 Text 5 65.053


I'd like to be able to split this dataframe into smaller chunks based on values contained in time.diff (say into chunks of 60 seconds). So for example, splitting into two using subset can be done like so, but if I have a much larger frame, I will end up writing 1000's lines of code!

I could also create a loop to iterate through a much larger dataframe and accomplish this task, but I know that using loops in R is rather slow.

So I'm wondering what approach I can take to split the larger frame into many smaller frames in a way that doesn't use a loop and can also increment smaller dataframe names e.g. df.sub.1, df.sub.2 ... df.sub.3

# Split into two frames based on matched criteria
df.split1 <- subset(df.tosplit, time.diff <= 60)
df.split2 <- subset(df.tosplit, time.diff > 60)

> df.split1
date text time.diff
1 2016-03-09 15:50:07 Text 1 0.000
2 2016-03-09 15:50:10 Text 2 2.808
3 2016-03-09 15:50:17 Text 3 10.128
4 2016-03-09 15:50:53 Text 4 45.952
> df.split2
date text time.diff
5 2016-03-09 21:26:15 Text 5 65.053
6 2016-03-09 21:26:20 Text 6 85.110


I've included some sample code to create the first six lines which hopefully should be enough for folks to suggest a way forward here.

# Create Data
date <- c("2016-03-09 15:50:07", "2016-03-09 15:50:10", "2016-03-09 15:50:17" ,
"2016-03-09 15:50:53", "2016-03-09 21:26:15", "2016-03-09 21:26:20")
text <- c("Text 1", "Text 2", "Text 3", "Text 4", "Text 5", "Text 6")
time.diff <- c(0, 2.808, 10.128, 45.952, 65.053, 85.110)
df.tosplit <- data.frame(date, text, time.diff)

Answer

Using split():

split(df, paste0("df.split",df$time.diff%/%60))

$df.split0
                  dat   text time.diff
1 2016-03-09 15:50:07 Text 1     0.000
2 2016-03-09 15:50:10 Text 2     2.808
3 2016-03-09 15:50:17 Text 3    10.128
4 2016-03-09 15:50:53 Text 4    45.952

$df.split1
                  dat   text time.diff
5 2016-03-09 21:26:15 Text 5    65.053
6 2016-03-09 21:26:20 Text 6    85.110

Exotic way (see the explanation here):

list2env(split(df, paste0("df.split",df$time.diff%/%60)), .GlobalEnv)