Ali Zia Ali Zia - 1 month ago 18
R Question

summarising and aggregating column values as rows in R

My data frame contains mostly catagorical columns and one numerical column, The df looks like this(simplified):

**Home_type** **Garden_type** **NaighbourhoOd** **Rent**
Vila big brooklyn 5000
Vila small bronx 7000
Condo shared Sillicon valley 2000
Appartment none brooklyn 500
Condo none bronx 1700
Appartment none Sillicon Valley 800


For every catagorical column, I want to display all its distinct values, frequency and sum of rent associated with it.

The result should look like this:

**Variable** **Distinct_values** **No_of-Occurences** **SUM_RENT**
Home_type Vila 2 12000
Home_type Condo 2 3700
Home_type Appartment 2 1300
Garden_type big 1 5000
Garden_type small 1 7000
Garden_type shared 1 2000
Garden_type none 3 3000
Naighbourhood brooklyn 2 5500
Naighbourhood Bronx 2 8700
Naighbourhood Sillicon Valley 2 2800


I'm new to R and have tried to do this using melt in reshape2 but havent had much success, any help will be much appreciated.

Answer

I tend to prefer tidyr to reshape2 of late, though that is mostly because the syntax is more similar to dplyr -- which will make this task even easier as well due to loading the magrittr pipe (%>%) and it's data summary tools.

First, we gather (from tidyr) all of the non-Rent columns into long form (run just those two lines to see the result). Then group_by the columns you want to cluster together. Finally, summarise within each group to get the metrics you want.

df %>%
  gather(Variable, Distinct_Values, -Rent) %>%
  group_by(Variable, Distinct_Values) %>%
  summarise(
    `No_of-Occurences` = n()
    , SUM_RENT = sum(Rent)
  )

gives:

        Variable Distinct_Values `No_of-Occurences` SUM_RENT
           <chr>           <chr>              <int>    <int>
1    Garden_type             big                  1     5000
2    Garden_type            none                  3     3000
3    Garden_type          shared                  1     2000
4    Garden_type           small                  1     7000
5      Home_type      Appartment                  2     1300
6      Home_type           Condo                  2     3700
7      Home_type            Vila                  2    12000
8  NaighbourhoOd           bronx                  2     8700
9  NaighbourhoOd        brooklyn                  2     5500
10 NaighbourhoOd Sillicon valley                  1     2000
11 NaighbourhoOd Sillicon Valley                  1      800

(Note, your data has "V" and "v" for "Silicon Valley" causing the two separate lines.)