Joe Joe - 27 days ago 17
R Question

Using conditions in group_by()/summarize() loop

I have a dataframe that looks something like this (I have a lot more years and variables):

Name State2014 State2015 State2016 Tuition2014 Tuition2015 Tuition2016 StateGrants2014
Jared CA CA MA 22430 23060 40650 5000
Beth CA CA CA 36400 37050 37180 4200
Steven MA MA MA 18010 18250 18720 NA
Lary MA CA MA 24080 30800 24600 6600
Tom MA OR OR 40450 15800 16040 NA
Alfred OR OR OR 23570 23680 23750 3500
Cathy OR OR OR 32070 32070 33040 4700


My objective (in this example) is to get the mean tuition for each state, and the sum of state grants for each state. My thought was to subset the data by year:

State2014 Tuition2014 StateGrants2014
CA 22430 5000
CA 36400 4200
MA 18010 NA
MA 24080 6600
MA 40450 NA
OR 23570 3500
OR 32070 4700

State2015 Tuition2015
CA 23060
CA 37050
MA 18250
CA 30800
OR 15800
OR 23680
OR 32070

State2016 Tuition2016
MA 40650
CA 37180
MA 18720
MA 24600
OR 16040
OR 23750
OR 33040


Then I would
group_by
state and
summarize
(and save each as a separate df) to get the following:

State2014 Tuition2014 StateGrants2014
CA 29415 9200
MA 27513 6600
OR 27820 6600

State2015 Tuition2015
CA 30303
MA 18250
OR 23850

State2016 Tuition2016
CA 37180
MA 27990
OR 24277


Then I would merge the by state. Here is my code:

years = c(2014,2015,2016)
for (i in seq_along(years){
#grab the variables from a certain year and save as a new df.
df_year <- df[, grep(paste(years[[i]],"$",sep=""), colnames(df))]

#Take off the year from each variable name (to make it easier to summarize)
names(df_year) <- gsub(years[[i]], "", names(df_year), fixed = TRUE)

df_year <- df_year %>%
group_by(state) %>%
summarize(Tuition = mean(Tuition, na.rm = TRUE),
#this part of the code does not work. In this example, I only want to have this part if the year is 2016.
if (years[[i]]=='2016')
{Stategrant = mean(Stategrant, na.rm = TRUE)})

#rename df_year to df####
assign(paste("df",years[[i]],sep=''),df_year)
}


I have about 50 years of data, and a good amount of variables, so I wanted to use a loop. So my question is, how do i add a conditional statement (summarize certain variables conditioned on the year) in the
group_by()
/
summarize()
function? Thanks!

*Edit: I realize that I could take the
if{}
out of the function, and do something like:

if (years[[i]]==2016){
df_year <- df_year %>%
group_by(state) %>%
summarize(Tuition = mean(Tuition, na.rm = TRUE),
Stategrant = mean(Stategrant, na.rm = TRUE))

#rename df_year to df####
assign(paste("df",years[[i]],sep=''),df_year)
}

else{
df_year <- df_year %>%
group_by(state) %>%
summarize(Tuition = mean(Tuition, na.rm = TRUE))

#rename df_year to df####
assign(paste("df",years[[i]],sep=''),df_year)
{
}


but there are just so many combinations of variables, that using a for loop would not be very efficient or useful.

Answer Source

This is so much easier with tidy data, so let me show you how to tidy up your data. See http://r4ds.had.co.nz/tidy-data.html.

library(tidyr)
library(dplyr)

df <- gather(df, key, value, -Name) %>% 
  # separate years from the variables
  separate(key, c("var", "year"), sep = -5) %>% 
  # the above line splits up e.g. State2014 into State and 2014.
  # It does so by splitting at the fifth element from the end of the
  # entry. Please check that this works for your other variables
  # in case your naming conventions are inconsistent.
  spread(var, value) %>% 
  # turn numbers back to numeric
  mutate_at(.cols = c("Tuition", "StateGrants"), as.numeric) %>% 
  gather(var, val, -Name, -year, -State) %>% 
  # group by the variables of interest. Note that `var` here 
  # refers to Tuition and StateGrants. If you have more variables,
  # they will be included here as well. If you want to exclude more
  # variables from being included here in `var`, add more "-colName" 
  # entries in the `gather` statement above
  group_by(year, State, var) %>% 
  # summarize:
  summarise(mean_values = mean(val))

This gives you:

Source: local data frame [18 x 4]
Groups: year, State [?]
    year State         var mean_values
   <chr> <chr>       <chr>       <dbl>
1   2014    CA StateGrants     4600.00
2   2014    CA     Tuition    29415.00
3   2014    MA StateGrants          NA
4   2014    MA     Tuition    27513.33
5   2014    OR StateGrants     4100.00
6   2014    OR     Tuition    27820.00
7   2015    CA StateGrants          NA
8   2015    CA     Tuition    30303.33
9   2015    MA StateGrants          NA
10  2015    MA     Tuition    18250.00
11  2015    OR StateGrants          NA
12  2015    OR     Tuition    23850.00
13  2016    CA StateGrants          NA
14  2016    CA     Tuition    37180.00
15  2016    MA StateGrants          NA
16  2016    MA     Tuition    27990.00
17  2016    OR StateGrants          NA
18  2016    OR     Tuition    24276.67

If you don't like the shape of this, you can e.g. add an %>% spread(var, mean_values) behind the summarise statement to have the means for Tuition and StateGrants in different columns.

If you want to compute different functions for Tuition and Grants (e.g. mean of Tuition and sum for grants, you could do the following:

df <- gather(df, key, value, -Name) %>% 
   separate(key, c("var", "year"), sep = -5) %>% 
   spread(var, value) %>% 
   mutate_at(.cols = c("Tuition", "StateGrants"), as.numeric) %>% 
   group_by(year, State) %>% 
   summarise(Grant_Sum = sum(StateGrants, na.rm=T), Tuition_Mean = mean(Tuition) )

This gives you:

Source: local data frame [9 x 4]
Groups: year [?]

   year State Grant_Sum Tuition_Mean
  <chr> <chr>     <dbl>        <dbl>
1  2014    CA      9200     29415.00
2  2014    MA      6600     27513.33
3  2014    OR      8200     27820.00
4  2015    CA         0     30303.33
5  2015    MA         0     18250.00
6  2015    OR         0     23850.00
7  2016    CA         0     37180.00
8  2016    MA         0     27990.00
9  2016    OR         0     24276.67

Note that I used sum here, with na.rm = T, which returns 0 if all elements are NAs. Make sure this makes sense in your use case.

Also, just to mention it, to get your individual data.frames that you asked for, you can use filter(year == 2014) etc, as in df_2014 <- filter(df, year == 2014).