Rose Rose - 2 months ago 13
R Question

Sum variable by group then run function

I have a data frame that I want to run some statistical tests on. However, I want to group the data based on one of the columns first.

Here's an example data frame:

CATEGORY ITEM SHOP1 STOCK SHOP2 STOCK
Fruit Orange 5 9
Fruit Apple 12 32
Fruit Pear 17 6
Veg Carrots 59 72
Veg Potatoes 6 57
Veg Courgette 43 22
Veg Parsnips 5 9
... ... ... ...


So for this example, I want to look at the chi squared distribution but across categories - so I want to reduce the data to a table like this:

SHOP1 SHOP2
FRUIT 34 47
VEG 113 160


Where the table shows the sum of the stock for each category for each shop (this is a very simplified version - the data that I have runs to 37 categories over a few hundred rows), and no longer specifies the item, just the category.

So I thought I could
group_by(CATEGORY)
and then run the chi squared test on the grouped data, but that doesn't seem to work. I think I need to add up the two columns with numbers in, but I don't know how to do that in conjunction with the chi squared testing. I've been faffing with this for some time now with no luck, so I'd really appreciate your help!

Answer

We can use dplyr to summarise the data and the tidy function from the broom package to return the results of chisq.test in a data frame:

library(broom)
library(dplyr)

df %>% group_by(CATEGORY) %>%
  summarise_at(vars(matches("SHOP")), sum) %>%
  do(tidy(chisq.test(.[, grep("SHOP",names(.))])))
     statistic p.value parameter                                                       method
1 2.566931e-30       1         1 Pearson's Chi-squared test with Yates' continuity correction