Leo P. Leo P. - 22 days ago 4
R Question

Subtract from column the value of the first column that is not NA while going back through the data.frame

I've got a dataset with variables of the cumulative time spent on filling out a websurvey (every variable corresponds to one page of the survey). I need the variables to display not the cumulative time, but the time spent only on that page. Unfortunately, as there are filters in the questionnaire, some variables have

NA
due to pages being filtered.

Here's some sample data (the first variable can be considered as a baseline and does not include any
NA
):

rts5032939 rts5032955 rts5032973 rts5032974 rts5032975 rts5032977 rts5032978 rts5032979 rts5033085 rts5033089
1 70 99 NA 104 111 119 132 147 175 196
3 33 144 NA 156 165 NA 199 259 297 357
15 18 57 NA 63 69 80 99 174 190 221
20 2107 2126 NA 2131 2139 2209 2220 2236 NA 2274
24 1088 1111 NA 1117 1124 1133 1152 NA 1208 1228
30 27 61 NA 70 83 90 182 230 298 336


Here's a dput() for you to work with:

test <- structure(list(rts5032939 = c(70, 33, 18, 2107, 1088, 27), rts5032955 = c(99,
144, 57, 2126, 1111, 61), rts5032973 = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), rts5032974 = c(104,
156, 63, 2131, 1117, 70), rts5032975 = c(111, 165, 69, 2139,
1124, 83), rts5032977 = c(119, NA_real_, 80, 2209, 1133, 90), rts5032978 = c(132,
199, 99, 2220, 1152, 182), rts5032979 = c(147, 259, 174, 2236,
NA_real_, 230), rts5033085 = c(175, 297, 190, NA_real_, 1208, 298), rts5033089 = c(196,
357, 221, 2274, 1228, 336)), .Names = c("rts5032939", "rts5032955",
"rts5032973", "rts5032974", "rts5032975", "rts5032977", "rts5032978",
"rts5032979", "rts5033085", "rts5033089"), row.names = c(1L,
3L, 15L, 20L, 24L, 30L), class = "data.frame")


The expected output looks like this. Note that the variables don't have to be replaced, I'm perfectly fine with newly generated variables.

rts5032939 rts5032955 rts5032973 rts5032974 rts5032975 rts5032977 rts5032978 rts5032979 rts5033085 rts5033089
1 70 29 NA 5 7 8 13 15 28 21
3 33 111 NA 12 9 NA 34 60 38 60
15 18 39 NA 6 6 11 19 75 16 31
20 2107 19 NA 5 8 70 11 16 NA 38
24 1088 23 NA 6 7 9 19 NA 56 20
30 27 34 NA 9 13 7 92 48 68 38


Before I noticed the problem with the
NA
, I used

for (i in 2:10) {
df1[paste0("t_", i)] <- df1[i]-df1[i-1]
}


to generate new variables with the subtracted time. Obviously, this does not work when sometimes a
NA
-value will be used.

Considering that the
NA
happens case-by-case, the new approach has to loop through the columns and the rows. To be honest, I'm not even sure how to begin here. Can anyone point me to the direction on how this can be achieved? I'd be happy with a general advice from which on I could do the specifics myself.

Edit: To clarify, I need the output to be only the difference between the value and the value of the first not-
NA
-value that comes before in that row.

Answer

Hope this is better but will definitely be slow!

abc <- function(x){
  zz = as.numeric(x)
  w <- 0
  for (i in 1:length(zz)){
    if (i==1){
      w[i] = 0
    } else {
      w[i] <- zz[i] - zz[max(which(!is.na(zz)[1:i-1]))]
    }  
  }
  return(w)
}

t(apply(test, 1, abc))