user6780841 user6780841 - 2 months ago 8
R Question

Calculate duration/difference between first and n rows that match on column value

I'm trying to calculate difference/duration between the first and n rows of a dataframe that match in one column. I want to place that value in a new column "duration". Sample data: below.

y <- data.frame(c("USA", "USA", "USA", "France", "France", "Mexico", "Mexico", "Mexico"), c(1992, 1993, 1994, 1989, 1990, 1999, 2000, 2001))
colnames(y) <- c("Country", "Year")
y$Year <- as.integer(y$Year) # this is to match the class of my actual data


My desired result is:

1992 USA 0
1993 USA 1
1994 USA 2
1989 France 0
1990 France 1
1999 Mexico 0
2000 Mexico 1
2001 Mexico 2


I've tried using dplyr's group_by and mutate

y <- y %>% group_by(Country) %>% mutate(duration = Year - lag(Year))


but I can only get the actual lag year (e.g. 1999) or only calculate the difference between sequential rows getting me either NA for the first row of a country or 1 for all other rows with the same country. Many q & a's focus on difference between sequential rows and not between the first and n rows.

Thoughts?

Answer

This can be done by subtracting the first 'Year' with the 'Year' column after grouping by 'Country'.

y %>%
   group_by(Country) %>%
   mutate(duration = Year - first(Year))