user2575429 - 2 months ago 6

R Question

I have a data frame which looks like this

`a b c d`

1 1 1 0

1 1 1 200

1 1 1 300

1 1 2 0

1 1 2 600

1 2 3 0

1 2 3 100

1 2 3 200

1 3 1 0

I have a data frame which looks like this

`a b c d`

1 1 1 250

1 1 2 600

1 2 3 150

1 3 1 0

I am currently doing it

{

`n=nrow(subset(Wallmart, a==i & b==j & c==k ))`

sum=subset(Wallmart, a==i & b==j & c==k )

#sum

sum1=append(sum1,sum(sum$d)/(n-1))

}

I would like to add the 'd' coloumn and take the average by counting the number of rows without counting 0. For example the first row is (200+300)/2 = 250.

Currently I am building a list that stores the 'd' coloumn but ideally I want it in the format above. For example first row would look like

`a b c d`

1 1 1 250

This is a very inefficient way to do this work. The code takes a long time to run in a loop.

so any help is appreciated that makes it run faster. The original data frame has about a million rows.

Answer

You may try `aggregate`

:

```
aggregate(d ~ a + b + c, data = df, sum)
# a b c d
# 1 1 1 1 500
# 2 1 3 1 0
# 3 1 1 2 600
# 4 1 2 3 300
```

As noted by @Roland, for bigger data sets, you may try `data.table`

or `dplyr`

instead, e.g.:

```
library(dplyr)
df %>%
group_by(a, b, c) %>%
summarise(
sum_d = sum(d))
# Source: local data frame [4 x 4]
# Groups: a, b
#
# a b c sum_d
# 1 1 1 1 500
# 2 1 1 2 600
# 3 1 2 3 300
# 4 1 3 1 0
```

**Edit** following updated question.
If you want to calculate group-wise mean, excluding rows that are zero, you may try this:

```
aggregate(d ~ a + b + c, data = df, function(x) mean(x[x > 0]))
# a b c d
# 1 1 1 1 250
# 2 1 3 1 NaN
# 3 1 1 2 600
# 4 1 2 3 150
df %>%
filter(d != 0) %>%
group_by(a, b, c) %>%
summarise(
mean_d = mean(d))
# a b c mean_d
# 1 1 1 1 250
# 2 1 1 2 600
# 3 1 2 3 150
```

However, because it seems that you wish to treat your zeros as missing values rather than numeric zeros, I think it would be better to convert them to `NA`

when preparing your data set, *before* the calculations.

```
df$d[df$d == 0] <- NA
df %>%
group_by(a, b, c) %>%
summarise(
mean_d = mean(d, na.rm = TRUE))
# a b c mean_d
# 1 1 1 1 250
# 2 1 1 2 600
# 3 1 2 3 150
# 4 1 3 1 NaN
```