dfamaral dfamaral - 28 days ago 8
R Question

Create column in R in a large database

My apologies if this question has already been answered, but I haven't found it. I'll post all my ideas to solve it. The problem is that the database is large and my PC cannot perform this calculation (core i7 and 8 GB RAM). I'm using Microsoft R Open 3.3.2 and RStudio 1.0.136.

I've trying to create a new column on a large database in R called tcm.RData (471 MB). My need is a column which divides Shape_Area by the sum of Shape_Area by COD (which I called ShapeSum). I first tried to do it in a single formula but, as it failed, I tried again in two steps with 1) summing up Shape_Area by COD and, if that succeed, to divide Shape_Area by ShapeSum.

> str(tcm)
Classes ‘data.table’ and 'data.frame': 26835293 obs. of 15 variables:
$ OBJECTID : int 1 2 3 4 5 6 7 8 9 10 ...
$ LAT : num -15.7 -15.7 -15.7 -15.7 -15.7 ...
$ LONG : num -58.1 -58.1 -58.1 -58.1 -58.1 ...
$ UF : chr "MT" "MT" "MT" "MT" ...
$ COD : num 510562 510562 510562 510562 510562 ...
$ AREA_97 : num 1130 1130 1130 1130 1130 ...
$ Shape_Area: num 255266.7 14875 25182.2 5503.9 95.5 ...
$ TYPE : chr "2" "2" "2" "2" ...
$ Nomes : chr NA NA NA NA ...
$ NEAR_DIST : num 376104 371332 371410 371592 371330 ...
$ tc_2004 : chr "AREA_URBANA" "DESFLORESTAMENTO_2004" "DESFLORESTAMENTO_2004" "DESFLORESTAMENTO_2004" ...
$ tc_2008 : chr "AREA_URBANA" "AREA_NAO_OBSERVADA" "AREA_NAO_OBSERVADA" "AREA_NAO_OBSERVADA" ...
$ tc_2010 : chr "AREA_URBANA" "PASTO_LIMPO" "PASTO_LIMPO" "PASTO_LIMPO" ...
$ tc_2012 : chr "AREA_URBANA" "PASTO_SUJO" "PASTO_SUJO" "PASTO_SUJO" ...
$ tc_2014 : chr "AREA_URBANA" "PASTO_LIMPO" "PASTO_LIMPO" "PASTO_SUJO" ...
- attr(*, ".internal.selfref")=<externalptr>

> tcm$ShapeSum <- tcm[, Shape_Area := sum(tcm$Shape_Area), by="COD"]
Error: cannot allocate vector of size 204.7 Mb
Error during wrapup: cannot allocate vector of size 542.3 Mb


I also tried the following codes, but all of them failed:

> tcm$ShapeSum <- apply(tcm[, c(Shape_Area)], 1, function(x) sum(x), by="COD")



Error in apply(tcm[, c(Shape_Area)], 1, function(x) sum(x), by = "COD") :
dim(X) must have a positive lenght


> tcm$ShapeSum <- mutate(tcm, ShapeSum = sum(Shape_Area), by="COD", package = "dplyr")



Error: cannot allocate vector of size 204.7 Mb
Error during wrapup: cannot allocate vector of size 542.3 Mb


> tcm$ShapeSum <- tcm[, transform(tcm, ShapeSum = sum(Shape_Area)), by="COD"]

> tcm$ShapeSum <- transform(tcm, aggregate(tcm$AreaShape, by=list(Category=tcm$COD), FUN=sum))



Error in aggregate.data.frame(as.data.frame(x), ...): no rows to aggregate


I thank very much for attention and for any suggestions to solve this problem.

Answer Source

We can use the data.table methods for creating the column as it is more efficient with the assignment (:=) which happens in place

library(data.table)
tcm[, ShapeSum := sum(Shape_Area), by = COD]

Or as @user20650 suggested it could be (based on the OP's description)

tcm[, ShapeSum := Shape_Area/sum(Shape_Area), by = COD]