wctjerry - 1 year ago 107
R Question

# Customly calculate deviation between columns in R

I am a newbie in R.

I am trying to calculate the deviation between columns, and expect several rules applied:

1. deviation is calculated by current value minus previous value

2. if current value is NA, then return NA without calculation

3. if previous value is NA, then current value minus the value before previous value, until minus a valid value

4. the value in the first column is always valid

For example:

``````start = c(1, 2, 3, 4)
a = c(2, NA, 5, 6)
b = c(4, 5, NA, 8)

test <- data.frame(start, a, b)
test
start  a  b
1     1  2  4
2     2 NA  5
3     3  5 NA
4     4  6  8
``````

Expected:

``````result

a_delta b_delta
1       1       2
2      NA       3
3       2      NA
4       2       2
``````

Note:

1. cell (2, 1) in result is NA because cell (2, 2) in test is NA

2. cell (2, 2) in result is 3 because cell (2, 3) minus cell (2, 1) in result gets 3

Here is my broken code. Any suggestions are welcomed:

``````f <- function(data){
cn <- colnames(data)
cl <- ncol(data)
for (i in 2:cl)){
if (is.na(data\$i)) {a <- NA}
else if (!is.na(data\$(i-1))) {paste(cn[i], "_delta") <- data\$cn[i] - data\$cn[i-1]}
else { # check if previous value is NA repeatively
t < i - 1
while (is.na(data\$cn[t])) {
t <- t - 1
}
paste(cn[i], "_delta") <- data\$cn[i] - data\$cn[t]
}
}

}

f(test)
``````

Main function:

``````CalcDev <- function(x){
x <- unlist(x)
if (!any(is.na(x))){
return(diff(x, 1))
} else {
tmp <- x[-1]
res <- diff(na.omit(x), 1)
tmp[!is.na(tmp)] <- res
return(tmp)
}
}
``````

And how to use:

``````start = c(1, 2, 3, 4)
a = c(2, NA, 5, 6)
b = c(4, 5, NA, 8)
test <- data.frame(start, a, b)
``````

Result:

``````   a  b
1  1  2
2 NA  3
3  2 NA
4  2  2
``````

You just need to rename columns.

I was unable to run your code, so no benchmark.

EDIT: Answering your comment, you can use `CalcDev` function in `dplyr` chain if you vecorize it:

``````CalcDev.Vect <- Vectorize(CalcDev)

test %>%
CalcDev.Vect %>%
.[, -1] %>%
as.data.frame
``````

You will get similar results, and it will be much faster, especially for bigger data sets.

There is two alternatives: using `CalcDev` inside `do({})` or `adply` directly in chain, but both will be slower solutions. Benchmarks for small data set :

``````                    expr     min      lq   mean  median      uq     max neval  cld
foo.plyr(test) 2240.34 2392.08 2511.3 2490.13 2577.32 3199.16   100  b
foo.do_dplyr(test) 2680.34 2933.70 3104.4 3015.15 3109.48 5771.83   100    d
foo.plyr_in_dplyr(test) 2471.51 2635.04 2805.7 2702.99 2802.29 9422.46   100   c
foo.Vect(test)  441.55  490.58  539.7  539.92  564.74  928.41   100 a
``````

And for bigger data sets difference in evaluation time will be more drastic.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download