Natty_E - 1 year ago 107

R Question

I have some data that looks a little like this

`ID year var1 var2`

1 1 1 NA 0.5632595

2 1 2 0.7546097 0.5609945

3 1 3 -0.4241935 NA

4 1 4 0.4056908 0.5890453

5 2 1 -0.8049815 0.3504281

6 2 2 0.8049250 0.4817798

7 2 3 NA NA

8 2 4 -0.2969572 0.4985812

9 3 1 0.2909882 0.8504004

10 3 2 1.0957994 0.7365867

11 3 3 -0.2884501 0.1454566

12 3 4 0.4999331 0.7978971

tmp <- structure(list(ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3), year = c(1L,

2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), var1 = c(NA, 0.754609745086276,

-0.424193528509845, 0.4056908200679, -0.804981499494056, 0.804924965958355,

NA, -0.2969572255706, 0.29098820839828, 1.09579940195461, -0.288450063674258,

0.499933144375212), var2 = c(0.563259549904615, 0.560994466999546,

NA, 0.589045349741355, 0.350428087171167, 0.481779781170189,

NA, 0.498581154504791, 0.850400378694758, 0.73658673488535, 0.145456639816985,

0.797897139331326)), .Names = c("ID", "year", "var1", "var2"), row.names = c(NA,

-12L), class = "data.frame")

What I want to be able to do, is fit a linear model for each ID to each column, i.e.

`tmp %>% group_by(ID) %>% lm(var1 ~ year, data = .)`

tmp %>% group_by(ID) %>% lm(var2 ~ year, data = .)

I would then use the coefficients of these models to fill in the missing (

`NA`

`coefs_id1_var1 <- coef(lm(var1 ~ year, data = tmp[tmp$ID == 1, ]))`

coefs_id1_var1[1] + coefs_id1_var1[2] * tmp[1, 2]

[1] -0.1341153

so the missing value for ID 1 Variable 1 would be replaced by

`-0.134`

`lapply`

`split`

`mutate_each`

My current solution is

`fillWithLinMod <- function(var, df) {`

mod <- as.formula(paste0(var, " ~ year"))

coefs <- coef(lm(mod, data = df))

for (i in 1:nrow(df)) {

if (is.na(df[i, var])) {

df[i, var] <- coefs[1] + df[i, "year"] * coefs[2]

}

}

df[, var][[1]]

}

(Note my df is a

`tibble`

Then I can use the following

`tmp$var1 <- do.call("c", lapply(split(tmp, tmp$ID), function(x) fillWithLinMod("var1", x)))`

Answer Source

Here is a worker function that you can apply throughout your subsets. You pass in a data.frame and a character vector of variable names to be filled. It assumes there is a variable called 'year' as you have done in your function.

```
fill_missing_with_lm <- function(dat, vars) {
for(i in seq_along(vars)) {
mod <- as.formula(paste0(vars[i], " ~ year"))
mod <- lm(mod, dat)
misses <- which(is.na(dat[[ vars[i] ]]))
for(j in misses) {
newdat <- data.frame(year = dat$year[j])
dat[[ vars[i] ]][j] <- predict(mod, newdat)
}
}
return(dat)
}
```

Then you can apply it with this workflow (nested data.frame's) which i really like. I have found it generally helpful for handling data where you want do to something a little nuanced to subsets of rows of your data. The basic principle is the `group_by() %>% nest`

. Then you use `purrr::map()`

to apply your nuanced actions to each of the nested data.frames.

```
library(dplyr)
library(tidyr)
library(purrr)
filled <- tmp %>%
group_by(ID) %>%
nest %>%
mutate(filled = map(data, fill_missing_with_lm, vars = c('var1', 'var2'))) %>%
select(ID, filled) %>%
unnest
```