LukasKawerau LukasKawerau - 1 month ago 4
R Question

R - Keep newest data that is not NA, unless only NA available

I have a dataset with three columns: year, city, value which looks like this:

year = c(2010, 2013, 2010, 2013, 2013)
city = c("Berlin","Berlin", "Munich", "Munich", "Frankfurt")
value = c(1234, NA, NA, 6372, NA)
data <- data.frame(year, value1, value2)

year city value
1 2010 Berlin 1234
2 2013 Berlin NA
3 2010 Munich NA
4 2013 Munich 6372
5 2013 Frankfurt NA


I would like to know how to subset this so that I keep only the newest data that is available, so that at the end I am left with data like this:

year city value
1 2010 Berlin 1234
2 2013 Munich 6372
3 2013 Frankfurt NA


If I subset on the highest year, I get NAs where for that year there isn't data. If I subset on
!is.na()
, I lose all rows where there's only NA available.

What I want to do specifically is get the highest year for a given city with data, unless there are only NAs for that city, then the highest year with NA. How would I go about that?

Answer

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data)), grouped by 'city', we specify the 'i' as the 'year' in descending order index, if there are any non-NA 'value', we Subset the 'Data.table' based on the index of the first non-NA 'value' or else return the Subset of Data.table.

library(data.table)
setDT(data)[order(-year), if(any(!is.na(value)))  
            .SD[which(!is.na(value))[1L]] else .SD, by = city]

Or a compact option by @David Arenburg where we get the index from which.max

setDT(data)[order(-year), .SD[which.max(!is.na(value))], by = city] 

Or use a modification using .I to make it faster

setDT(data)[data[order(-year), .I[which.max(!is.na(value))], by = city]$V1]
Comments