Hamilton Blake Hamilton Blake - 3 months ago 23
R Question

Efficient Date Differencing in data.table

I have a list of people, with their dates of birth and death.

I would like to create a data set which has one record for each year in which each person was alive.

My code is as follows:

library(lubridate)
library(data.table)
deadPerson<-c("Albert Einstein","Erwin Schrodinger","Paul Dirac")
dateOfBirth<-c("1879-03-14","1887-08-12","1902-08-08")
dateOfDeath<-c("1955-04-18","1961-01-04","1984-10-20")

df<-data.frame(cbind(deadPerson,dateOfBirth,dateOfDeath))

df$dateOfBirth<-as.POSIXct(df$dateOfBirth)
df$dateOfDeath<-as.POSIXct(df$dateOfDeath)

for(i in 1:dim(df)[1])
{
birth_day<-df$dateOfBirth[i]
death_day<-df$dateOfDeath[i]
numDays<-as.numeric(death_day-birth_day)
numYears<-floor(numDays/365) # ignore leap years!
dates <- data.table(index=as.POSIXct(birth_day) + (0:numYears)*years(1))
dates$Person<-df$deadPerson[i]
if(i==1){output<-dates}
else{output<-rbind(output,dates)}
}
output$index<-year(output$index)
colnames(output)<-c("Year.Alive","Person")


The logic I've done is to loop through each person, and create a record for each year they were alive. Of course this is inefficient, however I can't work out a way of doing it efficiently (ie using an apply()).

Is there a better way to do this in data tables?

PS - I know I should have used years in the initial data set and simply add 1 year, but this is just for the purposes of a reproducible example. In my actual problem I'm messing about with creating multiple time series from data in a similar format to the above original data set, which is why I would like to use seconds, minutes, etc. and keep the datetime format.

Answer

You can use the summary syntax by group in data.table, and calculate the years vector at the j position, along which the group variable will be automatically broadcasted:

library(lubridate)
library(data.table)
df[, .(Year.Alive = seq(year(dateOfBirth), year(dateOfDeath))), by = (Person = deadPerson)]

#             Person Year.Alive
# 1: Albert Einstein       1879
# 2: Albert Einstein       1880
# 3: Albert Einstein       1881
# 4: Albert Einstein       1882
# 5: Albert Einstein       1883
# ---                           
# 231:      Paul Dirac       1980
# 232:      Paul Dirac       1981
# 233:      Paul Dirac       1982
# 234:      Paul Dirac       1983
# 235:      Paul Dirac       1984
Comments