sharp - 1 year ago 50
R Question

Find a cumulative sum of one column until a conditional sum on another column is met

I would like to find the preceding cumsum (i.e. cumsum minus the present row) for those rows of column B until the sum of the previous rows of column A including present row is <= 7.

I was able to find an answer using a traditional for loop. A vectorized implementation would be very helpful as I need to run it on a large dataset. Sharing my simple code in case it helps.

``````dt <- data.frame(A = c(0, 2, 3, 5, 8, 90, 8, 2, 4, 1, 2),
B = c(1, 0, 4, 2, 3, 4, 2, 1, 2, 3, 1),
Ans = c(0, 1, 1, 4, 0, 0, 0, 2, 3, 5, 6),
new=rep(0,11))

dt3 <- dt
for (i in 2:nrow(dt3)){
set<-0
count<-0
k=i-1
for (j in k:1){
count=count+dt3\$A[j+1]
if(count<=7){
set<-set+dt3\$B[j]
if(j==1){
dt3\$new[i]=set
}
}
else{
dt3\$new[i]=set
}
}
}
``````

Here are the 3 conditions to be satisfied:

1. If A > 7, then Ans resets to 0

2. If cumsum(A)<=7, then Ans is cumsum() of lagB

3. If cumsum(A) > 7, then Ans is cumsum() of lagB for the range of previous rows of A for which the sum is <=7

Here is a simplified version of the data (Column A and B) and the desired output is the Column Ans:

``````dt <- data.frame(A = c(0, 2, 3, 5, 8, 90, 8, 2, 4, 1, 2),
B = c(1, 0, 4, 2, 3, 4, 2, 1, 2, 3, 1),
Ans = c(0, 1, 1, 4, 0, 0, 0, 2, 3, 5, 6))

dt
A B Ans   Reason for value in Ans:
1   0 1   0       There are no preceeding rows in B so Ans is 0
2   2 0   1       Sum of value of A from row 2 to 1 is 2 <=7. So Ans is the value of B from first row = 1
3   3 4   1       Sum of value of A from row 3,2 and 1 is 5 <=7. So Ans is the sum of value of B in row 1 and 2, which is 1.
4   5 2   4       Value of A from row 4 is 5 which is <=7. So Ans is value of B from row 3, which is 4
5   8 3   0       Value of A in row 5 is 8 which is >7. So Ans is 0 (Value of Ans resets to 0 when A > 7).
6  90 4   0
7   8 2   0
8   2 1   2        Value of A in row 8 is 2 which <=7, so Ans is value of B in row 7 which is 2
9   4 2   3        Sum of value of A from row 9 and 8 is 6<=7, so Ans is sum of value of B in row 8 and 7 = 3
10  1 3   5        Sum of value of A from row 10,9 and 8 is 7<=7, so Ans is sum of value of B in row 9,8 and 7 =5.
11  2 1   6        Sum of value of A from row 11,10 and 9 is 7<=7, so Ans is sum of value of B in row 10,9 and 8 =6.
``````

Any help on how can I code this in R?

Please see the edit below which tries to answer the updated question.

If I have understood OP's intention right, then there are 3 rules:

1. if `A` is greater 7 then `Ans` is zero and grouping is restarted
2. if `cumsum(A)` within the group is less or equal 7 then `Ans` is the `cumsum()` of lagged `B`
3. if `cumsum(A)` within the group is greater 7 then `Ans` is lagged `B`

The code below produces the expected result for the given sample data set:

``````# create sample data set
DF <- data.frame(A = c(0, 2, 3, 5, 8, 90, 8, 2, 4, 1),
B = c(1, 0, 4, 2, 3, 4, 2, 1, 2, 3),
Ans = c(0, 1, 1, 4, 0, 0, 0, 2, 3, 5))
# load data.table, CRAN version 1.10.4 used
library(data.table)
# coerce to data.table
DT <- data.table(DF)
# create helper column with lagged values of
DT[, lagB := shift(B, fill = 0)][]
DT[, new := (A <= 7) * ifelse(cumsum(A) <= 7, cumsum(lagB), lagB), by = rleid(A <= 7)][
, lagB := NULL][]
``````
``````     A B Ans new
1:  0 1   0   0
2:  2 0   1   1
3:  3 4   1   1
4:  5 2   4   4
5:  8 3   0   0
6: 90 4   0   0
7:  8 2   0   0
8:  2 1   2   2
9:  4 2   3   3
10:  1 3   5   5
``````

`rleid(A <= 7)` creates unique group numbers for all consecutive streaks of `A` values not greater or greater of 7, resp. The `ifelse()` clause implements rules 2 and 3 within the grouping. By multiplying the result with `(A <= 7)`, rule 1 is implemented., thereby using the trick that `as.numeric(TRUE)` is 1 and `as.numeric(FALSE)` is 0. Finally, the helper column is removed.

Edit

With the additional information provided by the OP, I believe there is only one rule left:

• for each row find a window extending backwards which contains as many rows as `sum(A)` does not exceed 7. The answer is the sum of lagged `B` in the same window.
• For clarification, if the window has zero length because `A` in the inital row already exceeds 7, then the answer is zero.

The variable length of the sliding window is the tricky part here:

``````# sample data set consists of 11 rows after OP's edit
DF <- data.frame(A = c(0, 2, 3, 5, 8, 90, 8, 2, 4, 1, 2),
B = c(1, 0, 4, 2, 3, 4, 2, 1, 2, 3, 1),
Ans = c(0, 1, 1, 4, 0, 0, 0, 2, 3, 5, 6))
DT <- data.table(DF)
DT[, lagB := shift(B, fill = 0)][]

# find window lengths
DT[, wl := DT[, Reduce(`+`, shift(A, 0:6, fill = 0), accumulate = TRUE)][, rn := .I][
, Position(function(x) x <= 7, right = TRUE, unlist(.SD)), by = rn]\$V1][]

# sum lagged B in respective window
DT[, new := DT[, Reduce(`+`, shift(lagB, 0:6, fill = 0), accumulate = TRUE)][
, rn := .I][, wl := DT\$wl][, ifelse(is.na(wl), 0, unlist(.SD)[wl]), by = rn]\$V1][]
``````
``````     A B Ans lagB wl new
1:  0 1   0    0  7   0
2:  2 0   1    1  7   1
3:  3 4   1    0  7   1
4:  5 2   4    4  1   4
5:  8 3   0    2 NA   0
6: 90 4   0    3 NA   0
7:  8 2   0    4 NA   0
8:  2 1   2    2  1   2
9:  4 2   3    1  2   3
10:  1 3   5    2  3   5
11:  2 1   6    3  3   6
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download