wctjerry wctjerry - 12 days ago 6
R Question

Customly calculate deviation between columns in R

I am a newbie in R.

I am trying to calculate the deviation between columns, and expect several rules applied:


  1. deviation is calculated by current value minus previous value

  2. if current value is NA, then return NA without calculation

  3. if previous value is NA, then current value minus the value before previous value, until minus a valid value

  4. the value in the first column is always valid



For example:

start = c(1, 2, 3, 4)
a = c(2, NA, 5, 6)
b = c(4, 5, NA, 8)

test <- data.frame(start, a, b)
test
start a b
1 1 2 4
2 2 NA 5
3 3 5 NA
4 4 6 8


Expected:

result

a_delta b_delta
1 1 2
2 NA 3
3 2 NA
4 2 2


Note:


  1. cell (2, 1) in result is NA because cell (2, 2) in test is NA

  2. cell (2, 2) in result is 3 because cell (2, 3) minus cell (2, 1) in result gets 3



Here is my broken code. Any suggestions are welcomed:

f <- function(data){
cn <- colnames(data)
cl <- ncol(data)
for (i in 2:cl)){
if (is.na(data$i)) {a <- NA}
else if (!is.na(data$(i-1))) {paste(cn[i], "_delta") <- data$cn[i] - data$cn[i-1]}
else { # check if previous value is NA repeatively
t < i - 1
while (is.na(data$cn[t])) {
t <- t - 1
}
paste(cn[i], "_delta") <- data$cn[i] - data$cn[t]
}
}

}

f(test)

Answer

Main function:

CalcDev <- function(x){
  x <- unlist(x)
  if (!any(is.na(x))){
    return(diff(x, 1))
  } else {
    tmp <- x[-1]
    res <- diff(na.omit(x), 1)
    tmp[!is.na(tmp)] <- res
    return(tmp)
  }
}

And how to use:

start = c(1, 2, 3, 4)
a = c(2, NA, 5, 6)
b = c(4, 5, NA, 8)
test <- data.frame(start, a, b)
plyr::adply(test, 1, CalcDev)[, -1]

Result:

   a  b
1  1  2
2 NA  3
3  2 NA
4  2  2

You just need to rename columns.

I was unable to run your code, so no benchmark.

EDIT: Answering your comment, you can use CalcDev function in dplyr chain if you vecorize it:

CalcDev.Vect <- Vectorize(CalcDev)

test %>% 
  CalcDev.Vect %>%
  .[, -1] %>% 
  as.data.frame

You will get similar results, and it will be much faster, especially for bigger data sets.

There is two alternatives: using CalcDev inside do({}) or adply directly in chain, but both will be slower solutions. Benchmarks for small data set :

                    expr     min      lq   mean  median      uq     max neval  cld
          foo.plyr(test) 2240.34 2392.08 2511.3 2490.13 2577.32 3199.16   100  b  
      foo.do_dplyr(test) 2680.34 2933.70 3104.4 3015.15 3109.48 5771.83   100    d
 foo.plyr_in_dplyr(test) 2471.51 2635.04 2805.7 2702.99 2802.29 9422.46   100   c 
          foo.Vect(test)  441.55  490.58  539.7  539.92  564.74  928.41   100 a   

And for bigger data sets difference in evaluation time will be more drastic.