Alex Alex - 3 months ago 31
R Question

split, export and read dataset in R

I have a huge dataset, approximately 1 million rows, where various measurements occur for many objects. Each measurement is repeated for each object. My dataset looks a bit like this:

ID time measuremment1 measurement2 measurement3
1 5 12 324 123
1 6 123 654 45
1 3 346 556 548
2 2 234 345 253
2 8 35 998 316
2 17 515 1005 323
2 50 156 155 616
3 10 555 985 575
3 219 515 358 741
3 300 548 555 953


and so on...

I read the file in R using
read.csv


what I want to do, is split and possibly export the dataset into individual csv files according to my object ID. In other words, I want individual csv files for each object. In addition, I would ideally like to re-read the files in R because I want to perform calculations.
I want to do this as an iterative process, typing a few lines in R so it will save me time.
Does anyone know how to do this?

Thank you

Answer

Assumption:

  1. Initially all your data is stored in a single, large file, "mydata.csv", and you read it in into R into a large data frame dat;
  2. dat has been sorted by ID.

I have already checked with you that these conditions are satisfied.

I won't suggest using split, due to the size of your data frame. Instead, I suggest a "read + write" iterative procedure for all ID subset. We can use skip and nrows argument in read.csv to read a subset only.


We first determine number of cases for each ID:

x <- rle(dat$ID)  ## run length estimate
k <- x$lengths  ## number of cases for each ID
ID <- x$values  ## ID

Then we determine the rows where each ID chunk starts:

s <- c(0L, cumsum(k)) + 1L

We also store the column names of this data frame:

NAMES <- colnames(dat)

Then, we remove this large data frame dat from your R session, and perform "read + save" iteration for all ID chunks.

rm(dat); gc()
PATH <- getwd()

for (i in seq_along(ID)) {
  ## read in the data subset for the current `ID`
  tmp <- read.csv("mydata.csv", skip = s[i], nrows = k[i], header = FALSE)
  ## `write.csv` does not allow `col.names`, so we use `write.table`
  write.table(tmp, file = paste0(PATH,"/",ID[i],".csv"), row.names = FALSE,
              sep = ",", col.names = NAMES)
  }

In each iteration, only a subset is read into R session so memory footprint is greatly reduced. You might want to customize file path PATH, as at the moment all files are stored into your working director getwd().


Let's do a small test:

## example data frame
dat <- data.frame(ID=rep(1:3,each=5), val=sample(15))
NAMES <- colnames(dat)
x <- rle(dat$ID)
k <- x$lengths
ID <- x$values
s <- c(0L, cumsum(k)) + 1L

## store it into a csv file
write.csv(dat, file = "mydata.csv", row.names = FALSE)

## iterative "read + write"
PATH <- getwd()
for (i in seq_along(ID)) {
  tmp <- read.csv("mydata.csv", skip = s[i], nrows = k[i], header = FALSE)
  write.table(tmp, file = paste0(PATH,"/",ID[i],".csv"), row.names = FALSE,
              sep = ",", col.names = NAMES)
  }

After execution, I see three files: "1.csv", "2.csv" and "3.csv" in my working directory, each with desired data. So everything is successful.