I have a column, and in that column I have about 5-10 instances each of a few hundred thousand different strings. I want to count them and then put the counts into the corresponding lines. So I do:
newdf <- ddply(oldDF, ~BigVariable, transform, counts = length(BigVariable))
That works fine until I start approaching 1 million rows/1GB file size.
My R session is crashing with a fatal error every time with a dataset of this size or larger. However with 28 Gb of free memory, I don't see why that should be a problem, but my understanding from this thread
is that ddply can sometimes be a memory hog.
I'm pretty sure it is a memory issue though, because prior to the crash on my system monitor I see the modified memory and in use memory fighting over the free memory until the green bar triumphs and takes the last bit, with R crashing at that same moment.
- I have done the obvious and used data.table for more efficient memory use.
- There are a massive number of factors in one variable. I tried changing the offending variable to a character in the hopes it would make better use of data.table, but no dice.
What else should I try? Is there a more memory efficient way to get these counts and put them in all the appropriate rows?