Rose - 6 months ago 37

R Question

I have a data frame that I want to run some statistical tests on. However, I want to group the data based on one of the columns first.

Here's an example data frame:

`CATEGORY ITEM SHOP1 STOCK SHOP2 STOCK`

Fruit Orange 5 9

Fruit Apple 12 32

Fruit Pear 17 6

Veg Carrots 59 72

Veg Potatoes 6 57

Veg Courgette 43 22

Veg Parsnips 5 9

... ... ... ...

So for this example, I want to look at the chi squared distribution but across categories - so I want to reduce the data to a table like this:

`SHOP1 SHOP2`

FRUIT 34 47

VEG 113 160

Where the table shows the sum of the stock for each category for each shop (this is a very simplified version - the data that I have runs to 37 categories over a few hundred rows), and no longer specifies the item, just the category.

So I thought I could

`group_by(CATEGORY)`

Answer

We can use `dplyr`

to summarise the data and the `tidy`

function from the `broom`

package to return the results of `chisq.test`

in a data frame:

```
library(broom)
library(dplyr)
df %>% group_by(CATEGORY) %>%
summarise_at(vars(matches("SHOP")), sum) %>%
do(tidy(chisq.test(.[, grep("SHOP",names(.))])))
```

`statistic p.value parameter method 1 2.566931e-30 1 1 Pearson's Chi-squared test with Yates' continuity correction`