Badgerliu Badgerliu - 11 days ago 8
R Question

How to get the sum shared values of all the randomly picked two columns in a dataframe

I'm quite new to R, so please forgive me. I even don't know how to ask this question...The purpose of this question is to figure out which two or three factors shared most.
I have a dataframe like this:

mydata<-read.table(header=TRUE, text="
A B C D
peak_1 peak_1 0 0
peak_2 0 0 peak_2
0 0 peak_3 peak_3
peak_4 0 0 peak_4
peak_6 0 0 0
peak_7 0 peak_7 0
peak_8 peak_8 peak_8 peak_8")


A,B,C and D are four factors. Hopefully this table can be displayed well in your R.
I want to figure out the number of shared value (but not 0) between every two columns. I'm expecting results will be displayed like below:

myresuts<-read.table(header=TRUE, text = "
factor_1 factor_2 number_of_shared
A B 2
A C 2
A D 3
B C 1
B D 1
C D 2")


For this small table, I can do the intersection manually. But in fact I have a quite big table with more than 100 columns to do such calculation. I wonder how to write a function to solve this problem.
Also, if I want to figure out the sum of shared values in every three column (hopefully this can be solved in the same way).

Thanks!

Answer

Your desired results suggest that you don't want to count zero values in the comparison. I'm doing this by converting zeros to NA first (I also convert to character so we can compare columns with non-overlapping values).

mydata <- lapply(mydata,
                 function(x) {
                    x[x==0] <- NA
                    as.character(x)
})

cc <- combn(names(mydata),2,
      FUN=function(x) {
         data.frame(matrix(x,nrow=1),
                    val=sum(mydata[[x[1]]]==mydata[[x[2]]],na.rm=TRUE))
      },
      simplify=FALSE)

do.call(rbind,cc)

This should work for 3 columns if you change the condition in the function appropriately ...