Connor J Connor J - 29 days ago 12
R Question

How to remove rows matching criteria and rows adjacent to them

I have the following sample data:

data <- data.table(ID = c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4),
date = c(1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6),
score = c(4,3,2,2,4,1,5,5,5,2,1,4,2,1,5,5,5,3,5,5,5,2,4,5))

ID date score
1: 1 1 4
2: 1 2 3
3: 1 3 2
4: 1 4 2
5: 1 5 4
6: 1 6 1
7: 2 1 5
8: 2 2 5
9: 2 3 5
10: 2 4 2
11: 2 5 1
12: 2 6 4
13: 3 1 2
14: 3 2 1
15: 3 3 5
16: 3 4 5
17: 3 5 5
18: 3 6 3
19: 4 1 5
20: 4 2 5
21: 4 3 5
22: 4 4 2
23: 4 5 4
24: 4 6 5
ID date score


I want to eliminate certain rows, and change others, based in part on their position in the table. I have two criteria, for each
ID
:


  1. If a row has
    date == 1
    and
    score == 5
    , I want to remove that row AND all subsequent rows that have
    score==5
    that follow immediately after that row, until
    score
    is not 5. (so, for example, for
    I == 4
    , I want to keep the data for dates 4,5,6).

  2. For all other dates where
    score == 5
    , I want to replace their score with the average of their preceding two scores (or just their previous score, if they only have one prior score).



So, the table I want to end up with is:

ID date score
1: 1 1 4.0
2: 1 2 3.0
3: 1 3 2.0
4: 1 4 2.0
5: 1 5 4.0
6: 1 6 1.0
7: 2 4 2.0
8: 2 5 1.0
9: 2 6 4.0
10: 3 1 2.0
11: 3 2 1.0
12: 3 3 1.5
13: 3 4 1.5
14: 3 5 1.5
15: 3 6 3.0
16: 4 4 2.0
17: 4 5 4.0
18: 4 6 3.0


What is the best way to go about this? I imagine it's some combination of
shift
and
.I
but I haven't been able to put it together.

Answer
# find rows satisfying 1st condition
torm = data[, if(score[1] == 5 & date[1] == 1) .I
            , by = .(ID, rleid(score), cumsum(date == 1))]$V1

library(caTools) # for running mean

data[-torm    # remove the extra rows
   # add a running mean
   ][, mn := runmean(score, 2, endrule = 'keep', align = 'right'), by = ID
   # compute the new score - a little care needed here in case we only have 5's in a group
   ][, new.score := ifelse(score == 5, mn[which(score != 5)[1]], score)
     , by = .(ID, cumsum(score != 5))][]
#    ID date score  mn new.score
# 1:  1    1     4 4.0       4.0
# 2:  1    2     3 3.5       3.0
# 3:  1    3     2 2.5       2.0
# 4:  1    4     2 2.0       2.0
# 5:  1    5     4 3.0       4.0
# 6:  1    6     1 2.5       1.0
# 7:  2    4     2 2.0       2.0
# 8:  2    5     1 1.5       1.0
# 9:  2    6     4 2.5       4.0
#10:  3    1     2 2.0       2.0
#11:  3    2     1 1.5       1.0
#12:  3    3     5 3.0       1.5
#13:  3    4     5 5.0       1.5
#14:  3    5     5 5.0       1.5
#15:  3    6     3 4.0       3.0
#16:  4    4     2 2.0       2.0
#17:  4    5     4 3.0       4.0
#18:  4    6     5 4.5       3.0