Alexander Alexander - 25 days ago 9
R Question

Summing rows after first occurance of a certain number

I would like to to get the sum of the rows after first occurrence of a certain number. In this case it is

'10'
for instance.

I though If we can know the row number after first occurrence and the ending row number of that group and we can sum in between them.

I can get the first occurrence of '10' each group but I don't know how can get the sum of the rows.

df <- data.frame(gr=rep(c(1,2),c(7,9)),
y_value=c(c(0,0,10,8,8,6,0),c(0,0,10,10,5,4,2,0,0)))

> df
gr y_value
1 1 0
2 1 0
3 1 10
4 1 8
5 1 8
6 1 6
7 1 0
8 2 0
9 2 0
10 2 10
11 2 10
12 2 5
13 2 4
14 2 2
15 2 0
16 2 0


My initial attempt is below which is not working for some reason even for grouping part:(!

library(dplyr)
df%>%
group_by(gr)%>%
mutate(check1=any(y_value==10),row_sum=which(y_value == 10)[1])


Expected output

> df
gr y_value sum_rows_range
1 1 0 22/4
2 1 0 22/4
3 1 10 22/4
4 1 8 22/4
5 1 8 22/4
6 1 6 22/4
7 1 0 22/4
8 2 0 21/6
9 2 0 21/6
10 2 10 21/6
11 2 10 21/6
12 2 5 21/6
13 2 4 21/6
14 2 2 21/6
15 2 0 21/6
16 2 0 21/6

Answer Source

A dplyr solution:

library(dplyr)
df %>%
  group_by(gr) %>%
  slice(if(any(y_value == 10)) (which.max(y_value == 10)+1):n() else row_number()) %>%
  summarize(sum = sum(y_value),
            rows = n()) %>%
  inner_join(df)

Notes:

The main idea is to slice on the rows after the first 10 occurs. any(y_value == 10)) and else row_number() are just to take care of the case where there are no 10's in y_value.

Reading the documentation for ?which.max, you will notice that when it is applied to a logical vector, in this case y_value == 10, "with both FALSE and TRUE values, which.min(x) and which.max(x) return the index of the first FALSE or TRUE, respectively, as FALSE < TRUE."

In other words, which.max(y_value == 10) will give the index of the first occurrence of 10. By adding 1 to it, I can start sliceing from the value right after the first occurrence of 10.

Result:

# A tibble: 16 × 4
      gr   sum  rows y_value
   <dbl> <dbl> <int>   <dbl>
1      1    22     4       0
2      1    22     4       0
3      1    22     4      10
4      1    22     4       8
5      1    22     4       8
6      1    22     4       6
7      1    22     4       0
8      2    21     6       0
9      2    21     6       0
10     2    21     6      10
11     2    21     6      10
12     2    21     6       5
13     2    21     6       4
14     2    21     6       2
15     2    21     6       0
16     2    21     6       0