lolibility lolibility - 9 months ago 45
R Question

Speeding up the performance of write.table

I have a

and I want to write it out. The dimensions of my
are 256 rows by 65536 columns. What are faster alternatives to

Answer Source

If all of your columns are of the same class, convert to a matrix before writing out, provides a nearly 6x speed up. Also, you can look into using write.matrix() from package MASS, though it did not prove faster for this example. Maybe I didn't set something up properly:

#Fake data
m <- matrix(runif(256*65536), nrow = 256)
#AS a data.frame
system.time(write.csv(, "dataframe.csv"))
#   user  system elapsed 
# 319.53   13.65  333.76 

#As a matrix
system.time(write.csv(m, "matrix.csv"))
#   user  system elapsed 
#  52.43    0.88   53.59 

#Using write.matrix()
system.time(write.matrix(m, "writematrix.csv"))
#   user  system elapsed 
# 113.58   59.12  172.75 


To address the concern raised below that the results above are not fair to data.frame, here are some more results and timing to show that the overall message is still "convert your data object to a matrix if possible. If not possible, deal with it. Alternatively, reconsider why you need to write out a 200MB+ file in CSV format if the timing is of the utmost importance":

#This is a data.frame
m2 <-*65536), nrow = 256))
#This is still 6x slower
system.time(write.csv(m2, "dataframe.csv"))
#   user  system elapsed 
# 317.85   13.95  332.44
#This even includes the overhead in converting to as.matrix in the timing 
system.time(write.csv(as.matrix(m2), "asmatrix.csv"))
#   user  system elapsed 
#  53.67    0.92   54.67 

So, nothing really changes. To confirm this is reasonable, consider the relative time costs of

m3 <- as.matrix(m2)
#   user  system elapsed 
#   0.77    0.00    0.77 

So, not really a big deal or skewing information as much as the comment below would believe. If you're still not convinced that using write.csv() on large data.frames is a bad idea performance wise, consult the manual under the Note:

write.table can be slow for data frames with large numbers (hundreds or more) of
columns: this is inevitable as each column could be of a different class and so must be
handled separately. If they are all of the same class, consider using a matrix instead.

Finally, consider moving to a native RData object if you're still losing sleep over saving things faster

system.time(save(m2, file = "thisisfast.RData"))
#   user  system elapsed 
#  21.67    0.12   21.81