Hamilton Blake Hamilton Blake - 1 year ago 71
R Question

Efficient Date Differencing in data.table

I have a list of people, with their dates of birth and death.

I would like to create a data set which has one record for each year in which each person was alive.

My code is as follows:

library(lubridate)
library(data.table)
deadPerson<-c("Albert Einstein","Erwin Schrodinger","Paul Dirac")
dateOfBirth<-c("1879-03-14","1887-08-12","1902-08-08")
dateOfDeath<-c("1955-04-18","1961-01-04","1984-10-20")

df<-data.frame(cbind(deadPerson,dateOfBirth,dateOfDeath))

df$dateOfBirth<-as.POSIXct(df$dateOfBirth)
df$dateOfDeath<-as.POSIXct(df$dateOfDeath)

for(i in 1:dim(df)[1])
{
birth_day<-df$dateOfBirth[i]
death_day<-df$dateOfDeath[i]
numDays<-as.numeric(death_day-birth_day)
numYears<-floor(numDays/365) # ignore leap years!
dates <- data.table(index=as.POSIXct(birth_day) + (0:numYears)*years(1))
dates$Person<-df$deadPerson[i]
if(i==1){output<-dates}
else{output<-rbind(output,dates)}
}
output$index<-year(output$index)
colnames(output)<-c("Year.Alive","Person")


The logic I've done is to loop through each person, and create a record for each year they were alive. Of course this is inefficient, however I can't work out a way of doing it efficiently (ie using an apply()).

Is there a better way to do this in data tables?

PS - I know I should have used years in the initial data set and simply add 1 year, but this is just for the purposes of a reproducible example. In my actual problem I'm messing about with creating multiple time series from data in a similar format to the above original data set, which is why I would like to use seconds, minutes, etc. and keep the datetime format.

Answer Source

You can use the summary syntax by group in data.table, and calculate the years vector at the j position, along which the group variable will be automatically broadcasted:

library(lubridate)
library(data.table)
df[, .(Year.Alive = seq(year(dateOfBirth), year(dateOfDeath))), by = (Person = deadPerson)]

#             Person Year.Alive
# 1: Albert Einstein       1879
# 2: Albert Einstein       1880
# 3: Albert Einstein       1881
# 4: Albert Einstein       1882
# 5: Albert Einstein       1883
# ---                           
# 231:      Paul Dirac       1980
# 232:      Paul Dirac       1981
# 233:      Paul Dirac       1982
# 234:      Paul Dirac       1983
# 235:      Paul Dirac       1984