dward4 dward4 - 1 year ago 44
R Question

Creating new data frame, each element is a selected sum of elements in 2nd data frame where 2 column elements match, sum is from third column

I have a dataframe, test, that looks like

c1 c2 c3
1 98 0 2013-08
2 231 0 2011-01
3 231 2.68 2011-03
4 231 1 2011-01
... ... ... ...


That continues on for many more rows. Column c1 has values from 1-297, while c3 has year-month values that consecutively move from 2011-01 to 2015-01. There are multiple rows that have the same c1 and c3 values.

I want to sum up each instance of c1 at each time step (so for all rows where c1 = x and c3 = y, sum those elements and get a result) and output that to a new data frame where each row represents 1 of the types from c1 (1-297), and each column is the corresponding year-month.

I am attempting to use acast (based off a suggestion) to transform it to a data frame with the rows based off of c1 values, with columns from c3, so it looks like

2011-01 2011-02 2011-03 ...
1 0 1.5 2.3 ...
2 0 3.4 0 ...
3 5 2.2 1.1 ...
4 4 2.2 4.4 ...
... ... ... ...


I have been attempting to transform this via acast:

acast(test, test$c3 ~ test$c1, value.var = "c2")


But end up with a matrix/data frame of type int. The rows and columns are correct (1-297, 2011-01 - 2015-01), however the values inside the cells are wrong.

Again just to clarify, in the new data frame each element would represent the sum of elements in the first data frame for all elements that share the same c1 and c3 values.

I believe the issue is that acast sees matching combinations and does something that I don't want it to do. How would I solve this problem? I do not need acast if another solution presents itself.

Answer Source

You should use tidyverse packages dplyr and tidyr:

library(dplyr)
library(tidyr)

df <- test %>%
    group_by(c1, c3) %>%
    summarise(total = sum(c2)) %>%
    spread(c3, total)

Example

I used your simple data frame as an example:

#>   c1   c2      c3
#> 1  98 0.00 2013-08
#> 2 231 0.00 2011-01
#> 3 231 2.68 2011-03
#> 4 231 1.00 2011-01

And after running the code, df looks like this:

#>      c1 `2011-01` `2011-03` `2013-08`
#> 1    98        NA        NA         0
#> 2   231         1      2.68        NA

Explanation

  • group_by(c1, c3) groups the variables c1 and c3 in your data frame
  • summarise(total = sum(c2)) sums up c2 (taking into account the c1, c3 groupings)
  • spread(c3, total) transforms the data frame into a "wide" format with the c3 variables going across the columns
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download