LukasKawerau - 1 year ago 58
R Question

# R - Keep newest data that is not NA, unless only NA available

I have a dataset with three columns: year, city, value which looks like this:

``````year = c(2010, 2013, 2010, 2013, 2013)
city = c("Berlin","Berlin", "Munich", "Munich", "Frankfurt")
value = c(1234, NA, NA, 6372, NA)
data <- data.frame(year, value1, value2)

year    city    value
1 2010    Berlin   1234
2 2013    Berlin     NA
3 2010    Munich     NA
4 2013    Munich   6372
5 2013 Frankfurt     NA
``````

I would like to know how to subset this so that I keep only the newest data that is available, so that at the end I am left with data like this:

`````` year    city    value
1 2010    Berlin   1234
2 2013    Munich   6372
3 2013 Frankfurt     NA
``````

If I subset on the highest year, I get NAs where for that year there isn't data. If I subset on
`!is.na()`
, I lose all rows where there's only NA available.

What I want to do specifically is get the highest year for a given city with data, unless there are only NAs for that city, then the highest year with NA. How would I go about that?

Answer Source

We can use `data.table`. Convert the 'data.frame' to 'data.table' (`setDT(data)`), grouped by 'city', we specify the 'i' as the 'year' in descending `order` index, `if` there are `any` non-NA 'value', we Subset the 'Data.table' based on the index of the first non-NA 'value' or `else` return the Subset of Data.table.

``````library(data.table)
setDT(data)[order(-year), if(any(!is.na(value)))
.SD[which(!is.na(value))[1L]] else .SD, by = city]
``````

Or a compact option by @David Arenburg where we get the index from `which.max`

``````setDT(data)[order(-year), .SD[which.max(!is.na(value))], by = city]
``````

Or use a modification using `.I` to make it faster

``````setDT(data)[data[order(-year), .I[which.max(!is.na(value))], by = city]\$V1]
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download