Ashley Thomas Ashley Thomas - 4 months ago 29
R Question

Summarizing a dataframe by date and group

I am trying to summarize a data set by a few different factors. Below is an example of my data:

household<-c("household1","household1","household1","household2","household2","household2","household3","household3","household3")
date<-c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9))
value<-c(1:9)
type<-c("income","water","energy","income","water","energy","income","water","energy")
df<-data.frame(household,date,value,type)

household date value type
1 household1 1999-05-10 100 income
2 household1 1999-05-25 200 water
3 household1 1999-10-12 300 energy
4 household2 1999-02-02 400 income
5 household2 1999-08-20 500 water
6 household2 1999-02-19 600 energy
7 household3 1999-07-01 700 income
8 household3 1999-10-13 800 water
9 household3 1999-01-01 900 energy


I want to summarize the data by month. Ideally the resulting data set would have 12 rows per household (one for each month) and a column for each category of expenditure (water, energy, income) that is a sum of that month's total.

I tried starting by adding a column with a short date, and then I was going to filter for each type and create a separate data frame for the summed data per transaction type. I was then going to merge those data frames together to have the summarized df. I attempted to summarize it using ddply, but it aggregated too much, and I can't keep the household level info.

ddply(df,.(shortdate),summarize,mean_value=mean(value))
shortdate mean_value
1 14/07 15.88235
2 14/09 5.00000
3 14/10 5.00000
4 14/11 21.81818
5 14/12 20.00000
6 15/01 10.00000
7 15/02 12.50000
8 15/04 5.00000


Any help would be much appreciated!

Answer

It sounds like what you are looking for is a pivot table. I like to use reshape::cast for these types of tables. If there is more than one value returned for a given expenditure type for a given household/year/month combination, this will sum those values. If there is only one value, it returns the value. The "sum" argument is not required but only placed there to handle exceptions. I think if your data is clean you shouldn't need this argument.

hh <- c("hh1", "hh1", "hh1", "hh2", "hh2", "hh2", "hh3", "hh3", "hh3")
date <- c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"),  9))
value <- c(1:9)
type <- c("income", "water", "energy", "income", "water", "energy", "income", "water", "energy")
df <- data.frame(hh,  date, value,  type)

# Add date and year
df$month <- month(df$date)
df$year <- year(df$date)

# Load reshape library
library(reshape)

# Run cast from reshape, creates pivot table    
dfNew <- cast(df, hh+year+month~type, value = "value", sum)

> dfNew
   hh year month energy income water
1 hh1 1999     4      3      0     0
2 hh1 1999    10      0      1     0
3 hh1 1999    11      0      0     2
4 hh2 1999     2      0      4     0
5 hh2 1999     3      6      0     0
6 hh2 1999     6      0      0     5
7 hh3 1999     1      9      0     0
8 hh3 1999     4      0      7     0
9 hh3 1999     8      0      0     8