Lingyu Kong - 1 year ago 108
R Question

# How to replace NA with most recent non-NA by group?

I have a DF of individuals with some incomplete and repeated characteristics as following:

``````    name <- c("A", "A", "B", "B", "B", "C", "D", "D")
age <- c(28,NA,NA,NA,NA,NA,53,NA)
birthplace <- c("city1",NA, "city2",NA,NA,NA,NA,NA)
value <- 100:107
df <- data.frame(name,age,birthplace,value)

name age birthplace value
1    A  28      city1   100
2    A  NA       <NA>   101
3    B  NA      city2   102
4    B  NA       <NA>   103
5    B  NA       <NA>   104
6    C  NA       <NA>   105
7    D  53       <NA>   106
8    D  NA       <NA>   107
``````

Since the value is unique for row. I want complete each row with aviable person's detail like this:

``````       name age birthplace value
1    A  28      city1   100
2    A  28      city1   101
3    B  NA      city2   102
4    B  NA      city2   103
5    B  NA      city2   104
6    C  NA       <NA>   105
7    D  53       <NA>   106
8    D  53       <NA>   107
``````

I tried to use

``````library(zoo)
library(dplyr)
df <- df %>% group_by(name) %>% na.locf(na.rm=F)
``````

But it does't work very well. Any idea for implement function by group?

As another base R solution, here is a poor man's na.locf

``````fill_down <- function(v) {
if (length(v) > 1) {
keep <- c(TRUE, !is.na(v[-1]))
v[keep][cumsum(keep)]
} else v
}
``````

To fill down by group, the approach is to use `tapply()` to split and apply to each group, and `split<-` to combine groups to the original geometry, as

``````fill_down_by_group <- function(v, grp) {
split(v, grp) <- tapply(v, grp, fill_down)
v
}
``````

(Better: `ave(v, grp, FUN=fill_down`). To process multiple columns, one might

``````elts <- c("age", "birthplace")
df[elts] <- lapply(df[elts], fill_down_by_group, df\$name)
``````

### Notes

1. I would be interested in seeing how a dplyr solution handles many columns, without hard-coding each? Answering my own question, I guess this is

``````library(dplyr); library(tidyr)
df %>% group_by(name) %>% fill_(elts)
``````
2. A more efficient base solution when the groups are already 'grouped' (e.g., `identical(grp, sort(grp))`) is

``````fill_down_by_grouped <- function(v, grp) {
if (length(v) > 1) {
keep <- !(duplicated(v) & is.na(v))
v[keep][cumsum(keep)]
} else v
}
``````
3. For me, `fill_down()` on a vector with about 10M elements takes ~225ms; `fill_down_by_grouped()` takes ~300ms independent of the number of groups; `fill_down_by_group()` scales with the number of groups; for 10000 groups ~2s, 10M groups about 36s

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download