Johnny Strings Johnny Strings - 2 months ago 11
R Question

R Sum a Column Grouped By Other Columns With Rollups

I've already written the following, which will summarize a target column from the input dataset, and includes partial sums (or rollups or whatever the preferred vernacular may be) for each of the other columns present.

This works fine but has an undesirable nested

for
loop, which I would like to remove in favor of more "functional" approach. I already attempted this, but despite more than a little reading and practice, I remain in a state of non-grokkery when it comes to the various
apply
and/or
dplyr
functions.

It may well be that everything I'm doing is wrong; e.g. the setup to prep for the loops may be unnecessary if the final solution doesn't need it, etc... basically I just want the generate the expected ouput when given the provided input...

Anyway, here's the code:

# dummy data -- assume this is given
#######################################################################
df1 <- c("AA","B","AA","B","AA","B","AA","B","AA","B","AA","B",
"M","M","N","N","M","M","N","N","M","M","N","N",
"X","X","X","X","Y","Y","Y","Y","Z","Z","Z","Z",
2,3,4,4,2,3,5,4,3,2,5,4)
dim(df1) <- c(12,4)
colnames(df1) <- c("f1","f2","f3","cnt")
df1 <- as.data.frame(df1,stringsAsFactors=F)
df1$cnt <- as.integer(df1$cnt)
#######################################################################
library(data.table)

# some hard-coded variables...
anyStr <- "(any)" # this string cannot appear in df1
targetColName <- "cnt" # name of the column being summed from df1
outputColName <- "sum" # name of our output column

# grab names of only the columns we're going after... (just do everything but the target)
colsToSummarize = (colnames(df1)[!colnames(df1) %in% list(targetColName)])

# create a data table of just the unique values for each of those columns...
df2 <- lapply(colsToSummarize, function(x) { unique(df1[,x])})
df2 <- as.data.table(df2)

# add a dummy row that basically means "any value" ...
# this string cannot otherwise be present in the data...
df2 <- rbind(df2,as.data.table(t(rep(anyStr,length(df2)))))
colnames(df2) <- c(colsToSummarize)

# expand df2 to generate all possible settings found in df1...
df2 <- unique(expand.grid(df2))
rownames(df2)<-NULL

# do all the sums... there's probably a clever way to do this using "apply" functions...
df2[,eval(outputColName)] <- 0
for (i2 in 1:nrow(df2)) {
for (i1 in 1:nrow(df1)) {
isMatch = T
for (j in colsToSummarize) {
if ((df2[i2,eval(j)]!=anyStr) & (df1[i1,eval(j)]!=df2[i2,eval(j)])) {
isMatch = F
break
}
}
if (isMatch) {
df2[i2,eval(outputColName)] = df2[i2,eval(outputColName)] + df1[i1,eval(targetColName)]
}
}
}


So, the sample dummy data looks like:

> df1
f1 f2 f3 cnt
1 AA M X 2
2 B M X 3
3 AA N X 4
4 B N X 4
5 AA M Y 2
6 B M Y 3
7 AA N Y 5
8 B N Y 4
9 AA M Z 3
10 B M Z 2
11 AA N Z 5
12 B N Z 4


... and the expected output:

> df2
f1 f2 f3 sum
1 AA M X 2
2 B M X 3
3 (any) M X 5
4 AA N X 4
5 B N X 4
6 (any) N X 8
7 AA (any) X 6
8 B (any) X 7
9 (any) (any) X 13
10 AA M Y 2
11 B M Y 3
12 (any) M Y 5
13 AA N Y 5
14 B N Y 4
15 (any) N Y 9
16 AA (any) Y 7
17 B (any) Y 7
18 (any) (any) Y 14
19 AA M Z 3
20 B M Z 2
21 (any) M Z 5
22 AA N Z 5
23 B N Z 4
24 (any) N Z 9
25 AA (any) Z 8
26 B (any) Z 6
27 (any) (any) Z 14
28 AA M (any) 7
29 B M (any) 8
30 (any) M (any) 15
31 AA N (any) 14
32 B N (any) 12
33 (any) N (any) 26
34 AA (any) (any) 21
35 B (any) (any) 20
36 (any) (any) (any) 41


Naturally, I'm OK with output that is essentially the same; (e.g. NA or blanks or whatever instead of "(any)", order of rows/columns is not important, etc...)

Aside: this is not identical to a SQL
group by with rollup
since this provides all permutations rather than a subset based on the order of variables in your
group by
clause... if someone reading this wants that subset, they would simply need to remove rows that contain unexpected "(any)" values.

Answer

HI as I do not have enough reputation I write it here and if it if useless, I will delete it.

My question is, do you know about addmargins() and have you tried it or why not using it? So using first stabs to sum up everything:

    table1 <- xtabs(cnt ~f1 + f2 + f3, data= df1)
> table1
, , f3 = X

    f2
f1   M N
  AA 2 4
  B  3 4

, , f3 = Y

    f2
f1   M N
  AA 2 5
  B  3 4

, , f3 = Z

    f2
f1   M N
  AA 3 5
  B  2 4

Then use addmargins() to calculate sums

tablle2 <- addmargins(table1)
> tablle2
, , f3 = X

     f2
f1     M  N Sum
  AA   2  4   6
  B    3  4   7
  Sum  5  8  13

, , f3 = Y

     f2
f1     M  N Sum
  AA   2  5   7
  B    3  4   7
  Sum  5  9  14

, , f3 = Z

     f2
f1     M  N Sum
  AA   3  5   8
  B    2  4   6
  Sum  5  9  14

, , f3 = Sum

     f2
f1     M  N Sum
  AA   7 14  21
  B    8 12  20
  Sum 15 26  41

finally ftable() to bring it in a nice form:

table3 <- ftable(tablle2)
> table3
        f3  X  Y  Z Sum
f1  f2                 
AA  M       2  2  3   7
    N       4  5  5  14
    Sum     6  7  8  21
B   M       3  3  2   8
    N       4  4  4  12
    Sum     7  7  6  20
Sum M       5  5  5  15
    N       8  9  9  26
    Sum    13 14 14  41

Maybe is this the output you want?

as.data.frame(table3)
    f1  f2  f3 Freq
1   AA   M   X    2
2    B   M   X    3
3  Sum   M   X    5
4   AA   N   X    4
5    B   N   X    4
6  Sum   N   X    8
7   AA Sum   X    6
8    B Sum   X    7
9  Sum Sum   X   13
10  AA   M   Y    2
11   B   M   Y    3
12 Sum   M   Y    5
13  AA   N   Y    5
14   B   N   Y    4
15 Sum   N   Y    9
16  AA Sum   Y    7
17   B Sum   Y    7
18 Sum Sum   Y   14
19  AA   M   Z    3
20   B   M   Z    2
21 Sum   M   Z    5
22  AA   N   Z    5
23   B   N   Z    4
24 Sum   N   Z    9
25  AA Sum   Z    8
26   B Sum   Z    6
27 Sum Sum   Z   14
28  AA   M Sum    7
29   B   M Sum    8
30 Sum   M Sum   15
31  AA   N Sum   14
32   B   N Sum   12
33 Sum   N Sum   26
34  AA Sum Sum   21
35   B Sum Sum   20
36 Sum Sum Sum   41
Comments