MFR MFR - 19 days ago 5
R Question

The average of values when my factor is excluded

I wish to take average of the values when my factor is excluded. This is my data:

dff= data.frame(v1 =c("a,b,d","b,a,g", "f,d") , v2 = c(1,2,3) )


for example for
a
I wish to take average of rows that does not have
a
which is row 3 and average is 3. for
f
I wish to take average of rows without
f
which are rrow 1 and 3 and the average of V2 is (1 + 3) / 2 =1.5 My ideal output will be

factor avg

a 3
b 3
d 2
f 1.5
g 2


I tried two approaches: First one:

dff2 <- cSplit(dff, "v1", " , ", "long")

c <- setDT(dff2)[, .(value = (sum(dff2$v2) - sum(v2))/(nrow(dff2)-.N)) , by = v1]


and the second:

library(tidyverse)

dff %>% separate_rows(v1, convert = TRUE) %>%
group_by(v1) %>%
summarise(avg = mean(.$v2[!.$v1 %in% v1]))


none of them gave me the desirable outcome

Answer

First, we can get the collection of factors using:

fac <- unique(unlist(strsplit(as.character(dff$v1),split=",")))
##[1] "a" "b" "d" "g" "f"

then, we can compute what you want using:

avg <- sapply(fac, function(f) mean(ifelse(grepl(f,dff$v1), NA, dff$v2),na.rm=TRUE))
##  a   b   d   g   f 
##3.0 3.0 2.0 2.0 1.5 

Here, we use grepl to determine if each fac is found in dff$v1. If it is, then the ifelse returns NA for that row, and if it is not, returns the value of dff$v2 for that row. We can then compute the mean, ignoring the NAs for those rows that fac is found in dff$v1, as you wished.

Or, if you prefer, we can put the results into a data.frame:

data.frame(factor=fac,avg=avg)
##  factor avg
##a      a 3.0
##b      b 3.0
##d      d 2.0
##g      g 2.0
##f      f 1.5

In order to have exact match using %in%, consider the data:

## Note use of stringsAsFactors=FALSE to avoid having to coerce to character
## for strsplit
dff <- data.frame(v1 =c("a,b,d","b,a,g", "a.b,f,d"), v2 = c(1,2,3), stringsAsFactors=FALSE)

fac <- unique(unlist(strsplit(dff$v1,split=",")))
##[1] "a"   "b"   "d"   "g"   "a.b" "f"  
avg <- sapply(fac, function(f) mean(ifelse(sapply(strsplit(dff$v1,split=","), function(d) f %in% d), NA, dff$v2),na.rm=TRUE))
##  a   b   d   g a.b   f 
##3.0 3.0 2.0 2.0 1.5 1.5