user2575429 - 28 days ago 4x
R Question

# Efficient method to filter and add based on certain conditions (3 conditions in this case)

I have a data frame which looks like this

``````     a    b    c   d
1    1    1   0
1    1    1   200
1    1    1   300
1    1    2   0
1    1    2   600
1    2    3   0
1    2    3   100
1    2    3   200
1    3    1   0
``````

I have a data frame which looks like this

``````     a    b    c   d
1    1    1   250
1    1    2   600
1    2    3   150
1    3    1   0
``````

I am currently doing it
{

``````  n=nrow(subset(Wallmart, a==i &    b==j & c==k  ))
sum=subset(Wallmart, a==i &    b==j & c==k  )
#sum
sum1=append(sum1,sum(sum\$d)/(n-1))
``````

}

I would like to add the 'd' coloumn and take the average by counting the number of rows without counting 0. For example the first row is (200+300)/2 = 250.
Currently I am building a list that stores the 'd' coloumn but ideally I want it in the format above. For example first row would look like

``````     a    b    c   d
1    1    1   250
``````

This is a very inefficient way to do this work. The code takes a long time to run in a loop.
so any help is appreciated that makes it run faster. The original data frame has about a million rows.

You may try `aggregate`:

``````aggregate(d ~ a + b + c, data = df, sum)
#   a b c   d
# 1 1 1 1 500
# 2 1 3 1   0
# 3 1 1 2 600
# 4 1 2 3 300
``````

As noted by @Roland, for bigger data sets, you may try `data.table` or `dplyr` instead, e.g.:

``````library(dplyr)
df %>%
group_by(a, b, c) %>%
summarise(
sum_d = sum(d))

# Source: local data frame [4 x 4]
# Groups: a, b
#
#   a b c sum_d
# 1 1 1 1   500
# 2 1 1 2   600
# 3 1 2 3   300
# 4 1 3 1     0
``````

Edit following updated question. If you want to calculate group-wise mean, excluding rows that are zero, you may try this:

``````aggregate(d ~ a + b + c, data = df, function(x) mean(x[x > 0]))
#   a b c   d
# 1 1 1 1 250
# 2 1 3 1 NaN
# 3 1 1 2 600
# 4 1 2 3 150

df %>%
filter(d != 0) %>%
group_by(a, b, c) %>%
summarise(
mean_d = mean(d))

#   a b c mean_d
# 1 1 1 1    250
# 2 1 1 2    600
# 3 1 2 3    150
``````

However, because it seems that you wish to treat your zeros as missing values rather than numeric zeros, I think it would be better to convert them to `NA` when preparing your data set, before the calculations.

``````df\$d[df\$d == 0] <- NA
df %>%
group_by(a, b, c) %>%
summarise(
mean_d = mean(d, na.rm = TRUE))

#   a b c mean_d
# 1 1 1 1    250
# 2 1 1 2    600
# 3 1 2 3    150
# 4 1 3 1    NaN
``````