lolibility lolibility - 11 days ago 6
R Question

Speeding up the performance of write.table

I have a

data.frame
and I want to write it out. The dimensions of my
data.frame
are 256 rows by 65536 columns. What are faster alternatives to
write.csv
?

Answer

If all of your columns are of the same class, convert to a matrix before writing out, provides a nearly 6x speed up. Also, you can look into using write.matrix() from package MASS, though it did not prove faster for this example. Maybe I didn't set something up properly:

#Fake data
m <- matrix(runif(256*65536), nrow = 256)
#AS a data.frame
system.time(write.csv(as.data.frame(m), "dataframe.csv"))
#----------
#   user  system elapsed 
# 319.53   13.65  333.76 

#As a matrix
system.time(write.csv(m, "matrix.csv"))
#----------
#   user  system elapsed 
#  52.43    0.88   53.59 

#Using write.matrix()
require(MASS)
system.time(write.matrix(m, "writematrix.csv"))
#----------
#   user  system elapsed 
# 113.58   59.12  172.75 

EDIT

To address the concern raised below that the results above are not fair to data.frame, here are some more results and timing to show that the overall message is still "convert your data object to a matrix if possible. If not possible, deal with it. Alternatively, reconsider why you need to write out a 200MB+ file in CSV format if the timing is of the utmost importance":

#This is a data.frame
m2 <- as.data.frame(matrix(runif(256*65536), nrow = 256))
#This is still 6x slower
system.time(write.csv(m2, "dataframe.csv"))
#   user  system elapsed 
# 317.85   13.95  332.44
#This even includes the overhead in converting to as.matrix in the timing 
system.time(write.csv(as.matrix(m2), "asmatrix.csv"))
#   user  system elapsed 
#  53.67    0.92   54.67 

So, nothing really changes. To confirm this is reasonable, consider the relative time costs of as.data.frame():

m3 <- as.matrix(m2)
system.time(as.data.frame(m3))
#   user  system elapsed 
#   0.77    0.00    0.77 

So, not really a big deal or skewing information as much as the comment below would believe. If you're still not convinced that using write.csv() on large data.frames is a bad idea performance wise, consult the manual under the Note:

write.table can be slow for data frames with large numbers (hundreds or more) of
columns: this is inevitable as each column could be of a different class and so must be
handled separately. If they are all of the same class, consider using a matrix instead.

Finally, consider moving to a native RData object if you're still losing sleep over saving things faster

system.time(save(m2, file = "thisisfast.RData"))
#   user  system elapsed 
#  21.67    0.12   21.81
Comments