Kia Eisinga Kia Eisinga - 2 months ago 19
R Question

Replacing NA with duplicates from adjacent row R

I have a dataset that contains features

Date
,
Age
and
Customer_ID
. Some of the rows in
Age
have missing values (NAs) in them and I would like to impute them.

Here is some example data:

column1 <- c("201101", "201102", "201101", "201102", "201103")
column2 <- c("12-17", "12-17", "30-35", NA, NA)
column3 <- c("1234", "1234", "5678", "5678", "5678")
df <- data.frame(column1, column2, column3)

Date Age Customer_ID
201101 12-17 1234
201102 12-17 1234
201101 30-35 5678
201102 NA 5678
201103 NA 5678


I would like to replace the NAs in
Age
with 30-35.

So for all NAs, it has to check whether there is another row with the same
Customer_ID
and replace the NA with the value for
Age
stated in the other row.

Any ideas on how to do this? Thanks.

Answer

You can use the fill function from tidyr. It is a tidyr function for last observation carried forward, i.e, fill NA values with previous non-NA value. In order for this to work, you can use arrange to sort the column2 which sorts the NA values behind non-NAs and then you can group by customer ID and fill the Age column:

library(dplyr)
library(tidyr)
df %>% arrange(Age) %>% group_by(Customer_ID) %>% fill(Age)

# Source: local data frame [5 x 3]
# Groups: Customer_ID [2]

#      Date    Age Customer_ID
#    <fctr>  <fctr>  <fctr>
# 1  201101   12-17    1234
# 2  201102   12-17    1234
# 3  201101   30-35    5678
# 4  201102   30-35    5678
# 5  201103   30-35    5678
Comments