Null-Hypothesis Null-Hypothesis - 1 month ago 11
R Question

R aggregate columns based on a factor column's value and create a new data frame

I have following R data table:

> head(dt)
X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 grp
1: 33653 2325 916 720 867 187 31 0 6 3 42 56 92 15 69 0 a-4
2: 18895 414 1116 570 1190 55 92 0 122 23 78 6 4 2 11 0 a-3
3: 1383 70 27 17 17 1 0 0 0 0 1 0 0 0 3 0 a-6
4: 396 72 34 5 18 0 0 0 0 0 0 0 0 0 0 0 a-5
5: 3915 1170 402 832 2791 316 12 5 118 51 32 9 62 27 1 0 a-3
6: 554 33 138 13 415 4 5 0 0 0 0 0 0 0 0 0 a-5


I want to create a new data frame that has column wise aggregation respect to value in the
grp
column. Based on above 6 records
row2
, and
row5
should be sum together as well as
row4
and
row6
, and new data table now have 4 rows instead of 6.

I tried to use
ddply
as follows:

> ddply(dt, numcolwise(sum))


but ended up getting following error:

Error in UseMethod("as.quoted") :
no applicable method for 'as.quoted' applied to an object of class "function"

Answer

You can do this easily with data.table:

library(data.table)
options(stringsAsFactors=F)
##
dt <- data.table(
  matrix(rep(1,96),ncol=16))
dt[,grp:=c(
  "a-4","a-3","a-6",
  "a-5","a-3","a-5")]
##
> dt[,lapply(.SD,sum),by=grp]
   grp V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
1: a-4  1  1  1  1  1  1  1  1  1   1   1   1   1   1   1   1
2: a-3  2  2  2  2  2  2  2  2  2   2   2   2   2   2   2   2
3: a-6  1  1  1  1  1  1  1  1  1   1   1   1   1   1   1   1
4: a-5  2  2  2  2  2  2  2  2  2   2   2   2   2   2   2   2

Edit
Here's how I would try to visualize the data. I'm going to use a slightly different set of data - same structure, different numbers:

library(data.table)
library(ggplot2)
options(stringsAsFactors=F)
##
dt <- data.table(
  matrix(1:96,ncol=16))
dt[,grp:=c(
  "a-4","a-3","a-6",
  "a-5","a-3","a-5")]
##
gt <- dt[,lapply(.SD,sum),by=grp]
> gt
   grp V1 V2 V3 V4 V5 V6 V7 V8  V9 V10 V11 V12 V13 V14 V15 V16
1: a-4  1  7 13 19 25 31 37 43  49  55  61  67  73  79  85  91
2: a-3  7 19 31 43 55 67 79 91 103 115 127 139 151 163 175 187
3: a-6  3  9 15 21 27 33 39 45  51  57  63  69  75  81  87  93
4: a-5 10 22 34 46 58 70 82 94 106 118 130 142 154 166 178 190

Start by reshaping the data.table from wide format to long format:

gt_long <- reshape(
  gt,
  direction="long",
  varying=list(names(gt)[-1]),
  v.names="Value",
  idvar="grp",
  timevar="V_ID",
  times=paste0("V",1:16))
> head(gt_long)
   grp V_ID Value
1: a-4   V1     1
2: a-3   V1     7
3: a-6   V1     3
4: a-5   V1    10
5: a-4   V2     7
6: a-3   V2    19

Then you can treat the Vis as factor variables, and you have some options with ggplot2:

ggplot(
  data=gt_long,
  aes(x=V_ID,y=Value,color=grp))+
  geom_point(size=5,alpha=.75)+
  scale_colour_brewer(type="div",palette=4)

enter image description here

Or, if that is too cluttered for you:

ggplot(
  data=gt_long,
  aes(x=V_ID,y=Value,color=grp))+
  geom_point(size=4)+
  facet_grid(grp ~ .)

enter image description here

Edit 2 There might be a slightly more concise way to order the levels correctly, but this works. I made a copy of the gt_long object just so I could check that it worked without modifying the original, but you can just use the original object.

gt_long2 <- copy(gt_long)
v_levels <- unique(gt_long2$V_ID)
gt_long2[,V_ID:=factor(
  V_ID,
  levels=v_levels,
  labels=v_levels)]

I'm not going to add the plots, but I reran them with gt_long2 and it looked good.