N Stenfors N Stenfors - 29 days ago 9
R Question

Extract first and last values among a number of columns in data frame

I have observed subjects

a
-
d
during 2-4 years and each year they reported a numerical value. I want to extract the first and last value of each subject, ignoring NAs. How do I create the new variables
first_value
and
last_value
? In this example, I have included the desired result:

df <- data.frame(subject = c("a","b","c","d"),
year1 = c(1, 2, NA, NA),
year2 = c(3, 4, NA, 5),
year3 = c(6, 7, 8, NA),
year4 = c(9, 10, NA, 11),
first_value <- c(1, 2, 8, 5),
last_value <- c(9, 10, 8, 11))


And what would be the solution if variables
year1
-
year4
were categorical?

Answer

Using the data.table package:

library(data.table)
setDT(df)[, `:=` (first_value = na.omit(unlist(.SD))[1], 
                  last_value = tail(na.omit(unlist(.SD)),1)), 
          by = subject][]

which gives:

   subject year1 year2 year3 year4 first_value last_value
1:       a     1     3     6     9           1          9
2:       b     2     4     7    10           2         10
3:       c    NA    NA     8    NA           8          8
4:       d    NA     5    NA    11           5         11

Following the suggestion of @alexis_laz, you can use max.col as follows to get the repective values:

f <- max.col(!is.na(df[c("year1", "year2", "year3", "year4")]), 'first')
l <- max.col(!is.na(df[c("year1", "year2", "year3", "year4")]), 'last')

df$first_value <- sapply(seq_along(f), function(i) df[,-1][i,f[i]])
df$last_value <- sapply(seq_along(l), function(i) df[,-1][i,l[i]])

which will get you the same result.


And using the dplyr and tidyr packages:

library(dplyr)
library(tidyr)

df %>% 
  gather(year, val, 2:5) %>% 
  filter(!is.na(val)) %>% 
  group_by(subject) %>% 
  summarise(first_value = first(val),
            last_value = last(val)) %>% 
  left_join(df, ., by = 'subject')

WARNING: A variation of this without using filter and using na.omit(val) (or val[!is.na(val)]) in summarise:

df %>% 
  gather(year, val, 2:5) %>% 
  group_by(subject) %>% 
  summarise(first_value = first(na.omit(val)),
            last_value = last(na.omit(val))) %>% 
  left_join(df, ., by = 'subject')

won't work as a result of the bugs reported here and here.