Haroon Rashid Haroon Rashid - 9 months ago 41
R Question

Compute similarity percentage OR Compute correlation between more than 2 objects

Consider I have four objects (

), and I ask five persons to label them (category 1 or 2) according to their physical appearance or something else. The labels provided by five persons for these objects are shown as

df <- data.frame(a = c(1,2,1,2,1), b=c(1,2,2,1,1), c= c(2,1,2,2,2), d=c(1,2,1,2,1))

In tabular format,

a b c d
1 1 2 1
2 2 1 2
1 2 2 1
2 1 2 2
1 1 2 1

Now I want to calculate the percentage of times a group of objects were given the same label (either 1 or 2). For example, objects a, b and d were given the same label by 3 persons out of 5 persons. So its percentage is 3/5 (=60%). While as objects a and d were given same labels by all the people, so its percentage is 5/5 (=100%)

I can calculate this statistic manually, but in my original dataset, I have 50 such objects and the people are 30 and the labels are 4 (1,2,3, and 4). How can I compute such statistics for this bigger dataset automatically? Are there any existing packages/tools in
which can calculate such statistics?

Note: A group can be of any size. In the first example, a group consists of a,b and d while as second example group consists of a and d.

Answer Source

There are two tasks here: firstly, making a list of all the relevant combinations, and secondly, evaluating and aggregating rowwise similarity. combn can start the first task, but it takes a little massaging to arrange the results into a neat list. The second task could be handled with prop.table, but here it's simpler to calculate directly.

Here I've used tidyverse grammar (primarily purrr, which is helpful for handling lists), but convert into base if you like.


map(2:length(df), ~combn(names(df), .x, simplify = FALSE)) %>%    # get combinations
    flatten() %>%    # eliminate nesting
    set_names(map_chr(., paste0, collapse = '')) %>%    # add useful names
    # subset df with combination, see if each row has only one unique value
    map(~apply(df[.x], 1, function(x){n_distinct(x) == 1})) %>% 
    map_dbl(~sum(.x) / length(.x))    # calculate TRUE proportion

##   ab   ac   ad   bc   bd   cd  abc  abd  acd  bcd abcd 
##  0.6  0.2  1.0  0.2  0.6  0.2  0.0  0.6  0.2  0.0  0.0