Alexey Ferapontov Alexey Ferapontov - 25 days ago 5
R Question

R: aggregate in data.table and reuse variable

I have a

data.table
that I want to summarize. It looks like that:

> DF
new_src action
1: cdn.adnxs.com 1
2: cdn.adnxs.com 1
3: cdn.adnxs.com 1
4: cdn.adnxs.com 3
5: s1.2mdn.net 1
6: cdn.adnxs.com 3
7: cdn.adnxs.com 3
8: cdn.adnxs.com 3
9: cdn.adnxs.com 3
10: cdn.adnxs.com 3


I want to aggregate by
new_src
, find highest occurrence by
action
, calculate frequency, print this
action
, print total.
I can do this in
ddply
using the
table
and reuse the variable within
ddply
so I don't need to run
table
multiple times.
I need to do this in
data.table
but I cannot reuse the
table
results, I have to run
table
twice.

Example. This works:

DF = structure(list(new_src = c("cdn.adnxs.com", "cdn.adnxs.com",
"cdn.adnxs.com", "cdn.adnxs.com", "s1.2mdn.net", "cdn.adnxs.com",
"cdn.adnxs.com", "cdn.adnxs.com", "cdn.adnxs.com", "cdn.adnxs.com"), action = c("1", "1", "1", "3", "1", "3", "3", "3", "3", "3")), .Names = c("new_src", "action"), class = c("data.table", "data.frame"), row.names = c(NA, -10L))

dt = DF[1:10,by=list(new_src),list(tb = sort(table(action),decreasing=T)[1], nm = names(sort(table(action),decreasing=T)[1]),tot = .N)]
View(dt)

ddpl = ddply(DF,.(new_src),summarize,tb = sort(table(action),decreasing=T)[1], nm = names(tb), tot = length(new_src))
View(ddpl)


This doesn't.

dt = DF[1:10,by=list(new_src),list(tb = sort(table(action),decreasing=T)[1], nm = names(tb),tot = .N)]


Is it possible with
data.table
? Thanks

Answer

I guess you want .N here:

DF[, .N, by=.(new_src, action)][
  order(-N), .(topv = action[1], topn = N[1], n = sum(N)), by=new_src]

         new_src topv topn n
1: cdn.adnxs.com    3    6 9
2:   s1.2mdn.net    1    1 1

To handle ties, add more arguments to order(-N, ...).


Instead of chaining the by=, nesting is another option:

DF[, .SD[, .N, by=action][order(-N), c(.SD[1], .(totn = sum(.N)))], by=new_src]

         new_src action N totn
1: cdn.adnxs.com      3 6    2
2:   s1.2mdn.net      1 1    1

I find it harder to follow, though; and it may be slower because j = .N is optimized.