Haroon Rashid - 1 year ago 83
R Question

# Compute similarity percentage OR Compute correlation between more than 2 objects

Consider I have four objects (

`a,b,c,d`
), and I ask five persons to label them (category 1 or 2) according to their physical appearance or something else. The labels provided by five persons for these objects are shown as

``````df <- data.frame(a = c(1,2,1,2,1), b=c(1,2,2,1,1), c= c(2,1,2,2,2), d=c(1,2,1,2,1))
``````

In tabular format,

`````` ---------
a b c d
---------
1 1 2 1
2 2 1 2
1 2 2 1
2 1 2 2
1 1 2 1
----------
``````

Now I want to calculate the percentage of times a group of objects were given the same label (either 1 or 2). For example, objects a, b and d were given the same label by 3 persons out of 5 persons. So its percentage is 3/5 (=60%). While as objects a and d were given same labels by all the people, so its percentage is 5/5 (=100%)

I can calculate this statistic manually, but in my original dataset, I have 50 such objects and the people are 30 and the labels are 4 (1,2,3, and 4). How can I compute such statistics for this bigger dataset automatically? Are there any existing packages/tools in
`R`
which can calculate such statistics?

Note: A group can be of any size. In the first example, a group consists of a,b and d while as second example group consists of a and d.

There are two tasks here: firstly, making a list of all the relevant combinations, and secondly, evaluating and aggregating rowwise similarity. `combn` can start the first task, but it takes a little massaging to arrange the results into a neat list. The second task could be handled with `prop.table`, but here it's simpler to calculate directly.

Here I've used `tidyverse` grammar (primarily `purrr`, which is helpful for handling lists), but convert into base if you like.

``````library(tidyverse)

map(2:length(df), ~combn(names(df), .x, simplify = FALSE)) %>%    # get combinations
flatten() %>%    # eliminate nesting
set_names(map_chr(., paste0, collapse = '')) %>%    # add useful names
# subset df with combination, see if each row has only one unique value
map(~apply(df[.x], 1, function(x){n_distinct(x) == 1})) %>%
map_dbl(~sum(.x) / length(.x))    # calculate TRUE proportion

##   ab   ac   ad   bc   bd   cd  abc  abd  acd  bcd abcd
##  0.6  0.2  1.0  0.2  0.6  0.2  0.0  0.6  0.2  0.0  0.0
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download