Marianne Marianne - 3 months ago 17
R Question

Cumulative sum, with lag, by group, on date-stamped observations

I have this data set of batting data from the GameDay servers:

eliasID teamID gameID gameDate h hr bb so rbi ab runs t d lob sb cs sf hbp

1 430203 kca 2010/04/01/arimlb-kcamlb-1 4/1/2010 1 2 0 0 0 0 0 0 0 0

2 459714 kca 2010/04/01/arimlb-kcamlb-1 4/1/2010 1 0 0 1 0 3 1 0 0 1 0 0 0 0

3 325392 kca 2010/04/01/arimlb-kcamlb-1 4/1/2010 0 0 1 0 0 1 0 0 0 0 0 0 0 0

4 429801 kca 2010/04/01/arimlb-kcamlb-1 4/1/2010 0 0 0 1 0 3 0 0 0 2 0 0 0 0

5 456714 kca 2010/04/01/arimlb-kcamlb-1 4/1/2010 0 0 1 0 0 4 0 0 0 2 0 0 0 0

6 150449 kca 2010/04/01/arimlb-kcamlb-1 4/1/2010 0 0 0 1 1 4 0 0 0 2 0 0 0 0

ba ID gameDateFormat year Year
1 0.345 1 2010-04-01 2010 NA
2 0.250 2 2010-04-01 2010 NA
3 0.319 3 2010-04-01 2010 NA
4 0.327 4 2010-04-01 2010 NA
5 0.333 5 2010-04-01 2010 NA
6 0.217 6 2010-04-01 2010 NA


My issue is that I'd like to build a running total of at bats (ab) for each game, but total only those at bats from games with gameDate lower than the row's gameDate, and with games in the same gameYear.

I've look at the for loop and dplyr has been suggested, but these all want to sum all the ab column for one player, when I need an ongoing sum added to each game to show the player's ab total for the year so far at that game.

I'm attempting to build the equivalent of the kind of statistics you see on baseball-ref.com.

In English, I'm looking for:

For each EliasID, gameID in Batting:
sum(ab) for the EliasID where the gameDate < this row's gameDate and the gameYear = this row's gameYear

What do you think?

Answer

Welcome to R programming. Because you did not provide complete sample data (i.e. dput() rather than just a print(), this answer makes a couple assumptions:

  • your data.frame is called df. You can replace this name with the actual name.
  • your gameDate is an actual date vector, not just a string vector. If it is a string, change it to a date with df$gameDate <- as.Date(df$gameDate, format = "%m/%d/%Y")

It appears that what you want is a "cumulative sum with lag, by group." (I recommend that you make this your title to make it clear that this is what you want.) Let's look at both of those parts.

Cumulative sum, with lag

As suggested in this answer, an easy way to introduce a lag of 1 into cumsum() is to replace the vector x1, x2, ... xn with 0, x1, x2, ... xn-1. Thus:

cumsumLag1 <- function(x){
  cumsum(c(0, head(x, n = -1))) # see ?cumsum and ?head, particularly the note on negative n
}
# test it out on first 5 counting numbers
cumsumLag1(1:5) # returns: 0  1  3  6 10

Your dataset should be in the right chronological order for the cumulative function. So you could do something with ?order like:

df <- df[order(df$gameDate)]

but we will use arrange() in dplyr (see below) to keep things simple.

By group

There are many ways to do sum (and similar functions) by group. Perhaps the simplest syntax is %>% group_by(thing) in dplyr. You want to group by year, and perhaps other variables (maybe teamId or playerId). One really unclear part of your question is what you're trying to group by, so please just focus on the concept here. The first challenge is that you don't have a year variable, and there are lots of ways to do this. Let's just do something like this:

df$gameYear <- as.POSIXlt(df$gameDate)$year + 1900 # see ?POSIXlt for more details

Putting it together

Using the chain operator %>%, we just sequence what we've already reviewed.

library(dplyr)

cumsumLag1 <- function(x) cumsum(c(0, head(x, n = -1))) 

df %>%
    mutate(gameYear = as.POSIXlt(gameDate)$year + 1900) %>%
    arrange(gameDate) %>%
    group_by(gameYear) %>%
    mutate(priorAtBats = cumsumLag1(ab))
Comments