user3138409 user3138409 - 1 month ago 12
R Question

r grep with or statment

I've been working on a r function to filter a large data frame of baseball team batting stats by game id, (i.e."2016/10/11/chnmlb-sfnmlb-1"), to create a list of past team matchups by season.

When I use some combinations of teams, output is correct, but others are not. (output contains a variety of ids)

I'm not real familiar with grep, and assume that is the problem. I patched my grep line and list output together by searching stack overflow and thought I had it till testing proved otherwise.

matchup.func <- function (home, away, df) {

matchups <- grep(paste('[0-9]{4}/[0-9]{2}/[0-9]{2}/[', home, '|', away, 'mlb]{6}-[', away, '|', home, 'mlb]{6}-[0-9]{1}', sep = ''), df$game.id, value = TRUE)

df <- df[df$game.id %in% matchups, c(1, 3:ncol(df))]

out <- list()
for (n in 1:length(unique(df$season))) {
for (s in unique(df$season)[n]) {
out[[s]] <- subset(df, season == s)
}
}
return(out)
}


sample of data frame:

bat.stats[sample(nrow(bat.stats), 3), ]
date game.id team wins losses flag ab r h d t hr rbi bb po da so lob avg obp slg ops roi season
1192 2016-04-11 2016/04/11/texmlb-seamlb-1 sea 2 5 away 38 7 14 3 0 0 7 2 27 8 11 15 0.226 0.303 0.336 0.639 0.286 R
764 2016-03-26 2016/03/26/wasmlb-slnmlb-1 sln 8 12 away 38 7 9 2 1 1 5 2 27 8 11 19 0.289 0.354 0.474 0.828 0.400 S
5705 2016-09-26 2016/09/26/oakmlb-anamlb-1 oak 67 89 home 29 2 6 1 0 1 2 2 27 13 4 12 0.260 0.322 0.404 0.726 0.429 R


sample of errant output:

matchup.func('tex', 'sea', bat.stats)
$S
date team wins losses flag ab r h d t hr rbi bb po da so lob avg obp slg ops roi season
21 2016-03-02 atl 1 0 home 32 4 7 0 0 2 3 2 27 19 2 11 0.203 0.222 0.406 0.628 1.000 S
22 2016-03-02 bal 0 1 away 40 11 14 3 2 2 11 10 27 13 4 28 0.316 0.415 0.532 0.947 0.000 S
47 2016-03-03 bal 0 2 home 41 10 17 7 0 2 10 0 27 9 3 13 0.329 0.354 0.519 0.873 0.000 S
48 2016-03-03 tba 1 1 away 33 3 5 0 1 0 3 2 24 10 8 13 0.186 0.213 0.343 0.556 0.500 S
141 2016-03-05 tba 2 2 home 35 6 6 2 0 0 5 3 27 11 5 15 0.199 0.266 0.318 0.584 0.500 S
142 2016-03-05 bal 0 5 away 41 10 17 5 1 0 10 4 27 9 10 13 0.331 0.371 0.497 0.868 0.000 S


sample of good:

matchup.func('bos', 'bal', bat.stats)
$S
date team wins losses flag ab r h d t hr rbi bb po da so lob avg obp slg ops roi season
143 2016-03-06 bal 0 6 home 34 8 14 4 0 0 8 5 27 5 8 22 0.284 0.330 0.420 0.750 0.000 S
144 2016-03-06 bos 3 2 away 38 7 10 3 0 0 7 7 24 7 13 25 0.209 0.285 0.322 0.607 0.600 S
209 2016-03-08 bos 4 3 home 37 1 12 1 1 0 1 4 27 15 8 26 0.222 0.292 0.320 0.612 0.571 S
210 2016-03-08 bal 0 8 away 36 5 12 5 0 1 4 4 27 9 4 27 0.283 0.345 0.429 0.774 0.000 S


On the good it gives a list of matchups as it should, (i.e. S, R, F, D), on the bad it outputs by season, but seems to only give matchups by date and not team. Not sure what to think.

Answer

I think that the issue is that regex inside [] behaves differently than you might expect. Specifically, it is looking for any matches to those characters, and in any order. Instead, you might try

matchups <- grep(paste0("(", home, "|", away, ")mlb-(", home, "|", away, ")mlb")
                 , df$game.id, value = TRUE)

That should give you either the home or the away team, followed by either the home or away team. Without more sample data though, I am not sure if this will catch edge cases.

You should also note that you don't have to match the entire string, so the date-finding regex at the beginning is likely superfluous.