Constantin - 1 month ago 5x
R Question

# Conditionally fill column by other columns values

I have ~ 4 million rows of personal data that looks like the following:

``````names <- c("Peter", "Peter", "Peter", "Peter", "Peter", "Peter", "Peter", "Lisa", "Bert", "Carine", "Carine", "Carine", "Carine", "Carine", "Carine")
luckyToday <- c(0,0,0,NA,0,0,1,NA,1,NA,0,0,0,1,1)
luckyYesterday <- NA
df1 <- data.frame(names,luckyToday,luckyYesterday)

df1
names luckyToday luckyYesterday
1   Peter          0             NA
2   Peter          0             NA
3   Peter          0             NA
4   Peter         NA             NA
5   Peter          0             NA
6   Peter          0             NA
7   Peter          1             NA
8    Lisa         NA             NA
9    Bert          1             NA
10 Carine         NA             NA
11 Carine          0             NA
12 Carine          0             NA
13 Carine          0             NA
14 Carine          1             NA
15 Carine          1             NA
``````

The data contains observations of people (some with 1 observation, some with more) and their luckiness (1=lucky, 0=unlucky, NA=no information). As kind of a lagged variable, I want to introduce a new variable ("luckyYesterday") that tells me if the person was lucky during the last observation or not. So I want the data look like this:

``````df2
names luckyToday luckyYesterday
1   Peter          0             NA
2   Peter          0              0
3   Peter          0              0
4   Peter         NA              0
5   Peter          0              0
6   Peter          0              0
7   Peter          1              0
8    Lisa         NA             NA
9    Bert          1             NA
10 Carine         NA             NA
11 Carine          0              0
12 Carine          0              0
13 Carine          0              0
14 Carine          1              0
15 Carine          1              1
``````

I know that R is not the perfect programm to apply such data wrangling, but it is necessary.

I want to consider the following things:

1. Consider that these are unique persons I have observations from, dont mix up observations of different people.

2. Lower observations are newer. The last observation of a unique person is the latest one (row #7 is the latest observation of Peter).

3. If there is a NA in the "luckyToday" column (for example row #4), I dont want the corresponding "luckyYesterdayNA" (row #5) filled with a NA, too. I rather want the corresponding cell get filled with the value of the newest "non-NA" observation in the "luckyToday" column, which is a "0" in the example data. If the previous observations are only NAs, then I assume that they equal the latest observation. (if I have 9 NAs and the 10th observation is a "1", then I want to assume that observations 1-9 are also "1".

4. If there is only one observation per person, I want a NA in the "luckyYesterday" column anyway since I have no knowledge about the previous state of luck.

I tried it by myself with 2 for-loops, but I takes ages on my data with over 4 million observations. Can anyone help my with a faster solution such as with data.table or an apply function, please? I would appreciate that so much!

Cheers

You can use the `shift` function from `data.table` to observe yesterday and `na.locf` function from `zoo` package to fill NA with yesterday or tomorrow depending on if the `fromLast` parameter is F or T, and also group by the name if you don't want to mix observations of different people:

``````library(data.table); library(zoo)
setDT(df1)[,luckyYesterday := shift(na.locf(luckyToday, fromLast = TRUE)), names]

df1
#      names luckyToday luckyYesterday
#  1:  Peter          0             NA
#  2:  Peter          0              0
#  3:  Peter          0              0
#  4:  Peter         NA              0
#  5:  Peter          0              0
#  6:  Peter          0              0
#  7:  Peter          1              0
#  8:   Lisa         NA             NA
#  9:   Bert          1             NA
# 10: Carine         NA             NA
# 11: Carine          0              0
# 12: Carine          0              0
# 13: Carine          0              0
# 14: Carine          1              0
# 15: Carine          1              1
``````