James Hanks James Hanks - 3 months ago 33
R Question

ddply memory requirements, solutions

I have a column, and in that column I have about 5-10 instances each of a few hundred thousand different strings. I want to count them and then put the counts into the corresponding lines. So I do:

newdf <- ddply(oldDF, ~BigVariable, transform, counts = length(BigVariable))


That works fine until I start approaching 1 million rows/1GB file size.
My R session is crashing with a fatal error every time with a dataset of this size or larger. However with 28 Gb of free memory, I don't see why that should be a problem, but my understanding from this thread is that ddply can sometimes be a memory hog.

I'm pretty sure it is a memory issue though, because prior to the crash on my system monitor I see the modified memory and in use memory fighting over the free memory until the green bar triumphs and takes the last bit, with R crashing at that same moment.


  • I have done the obvious and used data.table for more efficient memory use.

  • There are a massive number of factors in one variable. I tried changing the offending variable to a character in the hopes it would make better use of data.table, but no dice.



What else should I try? Is there a more memory efficient way to get these counts and put them in all the appropriate rows?

Answer

dplyr is a better alternative to ddply as it can be more efficient.

library(dplyr)
oldDF %>%
     group_by(BigVariable) %>%
     mutate(counts = n()) 

Or with data.table

library(data.table)
setDT(oldDF)[, counts := .N, by = BigVariable]
Comments