Jamesm131 Jamesm131 - 9 months ago 37
R Question

Most efficient way to replace NAs in a data frame based on a subset of other row factors (using median as an estimate) in R

I would like to estimate the values of a numeric variable in a data frame based on the median of the same variable given other factors. I would then like to replace the NA's for the numeric Variable with these estimates.

I have a data frame like this:

Fac1 Fac2 Var1
A a 20
A b 30
B a 5
B b 10

I have used the agregate function to find these medians for each combination of factors:

A a = 22
A b = 28
B a = 12
B b = 8

So any NA's in Var1 would be replaced with the corresponding median based on the combinations of the factors.

I understand that this may be done by replacing the missing values for each subset of the data individually, however that would become tedious quickly given more than two factors.
I was wondering if there are some more efficient ways to get this result.

Answer Source

You haven't provided a sample data but based on your question, I think this should work.

As @Roland mentioned no need to calculate median separately.

Assuming your dataframe as df. For every group (here Fac1 and Fac2) we calculate the median removing the NA values. Further we select only the indices which has NA values and replace it by its groups median value.

df$Var1[is.na(df$Var1)] <- ave(df$Var1,df$Fac1, df$Fac2, FUN=function(x) 
                                  median(x, na.rm = T)[is.na(df$Var1)]