Haroon Rashid - 4 months ago 21

R Question

Consider I have four objects (

`a,b,c,d`

`df <- data.frame(a = c(1,2,1,2,1), b=c(1,2,2,1,1), c= c(2,1,2,2,2), d=c(1,2,1,2,1))`

In tabular format,

`---------`

a b c d

---------

1 1 2 1

2 2 1 2

1 2 2 1

2 1 2 2

1 1 2 1

----------

Now I want to calculate the percentage of times a group of objects were given the same label (either 1 or 2). For example, objects a, b and d were given the same label by 3 persons out of 5 persons. So its percentage is 3/5 (=60%). While as objects a and d were given same labels by all the people, so its percentage is 5/5 (=100%)

I can calculate this statistic manually, but in my original dataset, I have 50 such objects and the people are 30 and the labels are 4 (1,2,3, and 4). How can I compute such statistics for this bigger dataset automatically? Are there any existing packages/tools in

`R`

Note: A group can be of any size. In the first example, a group consists of a,b and d while as second example group consists of a and d.

Answer

There are two tasks here: firstly, making a list of all the relevant combinations, and secondly, evaluating and aggregating rowwise similarity. `combn`

can start the first task, but it takes a little massaging to arrange the results into a neat list. The second task could be handled with `prop.table`

, but here it's simpler to calculate directly.

Here I've used `tidyverse`

grammar (primarily `purrr`

, which is helpful for handling lists), but convert into base if you like.

```
library(tidyverse)
map(2:length(df), ~combn(names(df), .x, simplify = FALSE)) %>% # get combinations
flatten() %>% # eliminate nesting
set_names(map_chr(., paste0, collapse = '')) %>% # add useful names
# subset df with combination, see if each row has only one unique value
map(~apply(df[.x], 1, function(x){n_distinct(x) == 1})) %>%
map_dbl(~sum(.x) / length(.x)) # calculate TRUE proportion
## ab ac ad bc bd cd abc abd acd bcd abcd
## 0.6 0.2 1.0 0.2 0.6 0.2 0.0 0.6 0.2 0.0 0.0
```