Johnny Strings - 4 months ago 20

R Question

I've already written the following, which will summarize a target column from the input dataset, and includes partial sums (or rollups or whatever the preferred vernacular may be) for each of the other columns present.

This works fine but has an undesirable nested

`for`

`apply`

`dplyr`

It may well be that everything I'm doing is wrong; e.g. the setup to prep for the loops may be unnecessary if the final solution doesn't need it, etc... basically I just want the generate the expected ouput when given the provided input...

Anyway, here's the code:

`# dummy data -- assume this is given`

#######################################################################

df1 <- c("AA","B","AA","B","AA","B","AA","B","AA","B","AA","B",

"M","M","N","N","M","M","N","N","M","M","N","N",

"X","X","X","X","Y","Y","Y","Y","Z","Z","Z","Z",

2,3,4,4,2,3,5,4,3,2,5,4)

dim(df1) <- c(12,4)

colnames(df1) <- c("f1","f2","f3","cnt")

df1 <- as.data.frame(df1,stringsAsFactors=F)

df1$cnt <- as.integer(df1$cnt)

#######################################################################

library(data.table)

# some hard-coded variables...

anyStr <- "(any)" # this string cannot appear in df1

targetColName <- "cnt" # name of the column being summed from df1

outputColName <- "sum" # name of our output column

# grab names of only the columns we're going after... (just do everything but the target)

colsToSummarize = (colnames(df1)[!colnames(df1) %in% list(targetColName)])

# create a data table of just the unique values for each of those columns...

df2 <- lapply(colsToSummarize, function(x) { unique(df1[,x])})

df2 <- as.data.table(df2)

# add a dummy row that basically means "any value" ...

# this string cannot otherwise be present in the data...

df2 <- rbind(df2,as.data.table(t(rep(anyStr,length(df2)))))

colnames(df2) <- c(colsToSummarize)

# expand df2 to generate all possible settings found in df1...

df2 <- unique(expand.grid(df2))

rownames(df2)<-NULL

# do all the sums... there's probably a clever way to do this using "apply" functions...

df2[,eval(outputColName)] <- 0

for (i2 in 1:nrow(df2)) {

for (i1 in 1:nrow(df1)) {

isMatch = T

for (j in colsToSummarize) {

if ((df2[i2,eval(j)]!=anyStr) & (df1[i1,eval(j)]!=df2[i2,eval(j)])) {

isMatch = F

break

}

}

if (isMatch) {

df2[i2,eval(outputColName)] = df2[i2,eval(outputColName)] + df1[i1,eval(targetColName)]

}

}

}

So, the sample dummy data looks like:

`> df1`

f1 f2 f3 cnt

1 AA M X 2

2 B M X 3

3 AA N X 4

4 B N X 4

5 AA M Y 2

6 B M Y 3

7 AA N Y 5

8 B N Y 4

9 AA M Z 3

10 B M Z 2

11 AA N Z 5

12 B N Z 4

... and the expected output:

`> df2`

f1 f2 f3 sum

1 AA M X 2

2 B M X 3

3 (any) M X 5

4 AA N X 4

5 B N X 4

6 (any) N X 8

7 AA (any) X 6

8 B (any) X 7

9 (any) (any) X 13

10 AA M Y 2

11 B M Y 3

12 (any) M Y 5

13 AA N Y 5

14 B N Y 4

15 (any) N Y 9

16 AA (any) Y 7

17 B (any) Y 7

18 (any) (any) Y 14

19 AA M Z 3

20 B M Z 2

21 (any) M Z 5

22 AA N Z 5

23 B N Z 4

24 (any) N Z 9

25 AA (any) Z 8

26 B (any) Z 6

27 (any) (any) Z 14

28 AA M (any) 7

29 B M (any) 8

30 (any) M (any) 15

31 AA N (any) 14

32 B N (any) 12

33 (any) N (any) 26

34 AA (any) (any) 21

35 B (any) (any) 20

36 (any) (any) (any) 41

Naturally, I'm OK with output that is essentially the same; (e.g. NA or blanks or whatever instead of "(any)", order of rows/columns is not important, etc...)

Aside: this is not identical to a SQL

`group by with rollup`

`group by`

Answer

HI as I do not have enough reputation I write it here and if it if useless, I will delete it.

My question is, do you know about addmargins() and have you tried it or why not using it? So using first stabs to sum up everything:

```
table1 <- xtabs(cnt ~f1 + f2 + f3, data= df1)
> table1
, , f3 = X
f2
f1 M N
AA 2 4
B 3 4
, , f3 = Y
f2
f1 M N
AA 2 5
B 3 4
, , f3 = Z
f2
f1 M N
AA 3 5
B 2 4
```

Then use addmargins() to calculate sums

```
tablle2 <- addmargins(table1)
> tablle2
, , f3 = X
f2
f1 M N Sum
AA 2 4 6
B 3 4 7
Sum 5 8 13
, , f3 = Y
f2
f1 M N Sum
AA 2 5 7
B 3 4 7
Sum 5 9 14
, , f3 = Z
f2
f1 M N Sum
AA 3 5 8
B 2 4 6
Sum 5 9 14
, , f3 = Sum
f2
f1 M N Sum
AA 7 14 21
B 8 12 20
Sum 15 26 41
```

finally ftable() to bring it in a nice form:

```
table3 <- ftable(tablle2)
> table3
f3 X Y Z Sum
f1 f2
AA M 2 2 3 7
N 4 5 5 14
Sum 6 7 8 21
B M 3 3 2 8
N 4 4 4 12
Sum 7 7 6 20
Sum M 5 5 5 15
N 8 9 9 26
Sum 13 14 14 41
```

Maybe is this the output you want?

```
as.data.frame(table3)
f1 f2 f3 Freq
1 AA M X 2
2 B M X 3
3 Sum M X 5
4 AA N X 4
5 B N X 4
6 Sum N X 8
7 AA Sum X 6
8 B Sum X 7
9 Sum Sum X 13
10 AA M Y 2
11 B M Y 3
12 Sum M Y 5
13 AA N Y 5
14 B N Y 4
15 Sum N Y 9
16 AA Sum Y 7
17 B Sum Y 7
18 Sum Sum Y 14
19 AA M Z 3
20 B M Z 2
21 Sum M Z 5
22 AA N Z 5
23 B N Z 4
24 Sum N Z 9
25 AA Sum Z 8
26 B Sum Z 6
27 Sum Sum Z 14
28 AA M Sum 7
29 B M Sum 8
30 Sum M Sum 15
31 AA N Sum 14
32 B N Sum 12
33 Sum N Sum 26
34 AA Sum Sum 21
35 B Sum Sum 20
36 Sum Sum Sum 41
```