alaj alaj - 2 months ago 6
R Question

How to match the boolean output of function "by" to input vector

I am trying to set data points that fall outside the upper/lower quantiles +/- 3*IQR to NA. The challenge I'm having is how to do this by group of data.

As an example the data set below has a split column and a value column. For each split I need to compute the the upper and lower quantiles and IQRs of the value column, then set the data points in the value column that meets the condition above to NA.

x <- structure(list(Split = c(1L, 1L, 3L, 2L, 2L, 2L, 2L, 1L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 3L, 1L, 3L, 2L, 3L), Value = c(0.9, 0.9, 3.5, 2.2, 2.2, 2.2, 2.2, 0.9, 3.5, 3.5, 3.5, 1.1, 3.5, 0.9, 1.9, 3.4, 0.9, 3.5, 2.2, 3.5)), .Names = c("Split", "Value"), class = "data.frame", row.names = c(NA, -20L))


I have used the "by" function to identify the values that need to be set to NA:

out <- by(
x$Value,
x$Split,
function(y)
y < (quantile(y, probs=c(.25, .75), na.rm = T)[1] - 3*IQR(y, na.rm = T)) |
y > (quantile(y, probs=c(.25, .75), na.rm = T)[2] + 3*IQR(y, na.rm = T)))


The I used the output with "unlist" to set the data points to NA:

x$Value[unlist(out)] <- NA


This does not work. Reason is the different sorting between the "by" output and the x$Value column.

Any suggestion on how I can match both outputs and set the corresponding values to NA?

Thanks.

Answer

You can use unsplit instead of unlist to reverse the split from by:

x$Value[unsplit(out, x$Split)] <- NA
##   Split Value
##1      1   0.9
##2      1   0.9
##3      3   3.5
##4      2   2.2
##5      2   2.2
##6      2   2.2
##7      2   2.2
##8      1   0.9
##9      3   3.5
##10     3   3.5
##11     3   3.5
##12     3    NA
##13     3   3.5
##14     1   0.9
##15     1    NA
##16     3    NA
##17     1   0.9
##18     3   3.5
##19     2   2.2
##20     3   3.5

Again, using x$Split as the factor that determined the split.

Comments