Melanie Julia Melanie Julia - 2 months ago 12
R Question

R percentage of overlaps

I've got about 25 datatables. Now I want to find overlaps in the first column in some of the tables and extract them. Furthermore I want to know how many overlaps there are and how many percent. The output should be a table. Here's an example:

Table1:

Gen Estimate Std. Error p-Value
1007_s_at -0.159699 0.07834 0.04265
1053_at -0.174647 0.064535 0.0098976
121_at 0.1765678 0.05116854 0.0000657


Table2:

Gen Estimate Std. Error p-Value
1494_f_at 0.2222467 0.0553653 0.0075838
121_at 0.873683 0.00898737 0.0088378
1316_at 0.098764 0.098456 0.048899
1007_s_at 0.89723 0.5675389 0.00007865


Table3:

Gen Estimate Std.Error p-Value
1007_s_at 0.0864567 0.8931278 0.005542
121_at 0.2378590 0.0236586 0.00005667
1494_f_at 0.4597023 0.9875357 0.0091234


The result should be:

Gen
1007_s_at
121_at

Overlapping rate: 20%


I tried the foverlaps function, but it didn't work.

I hope someone could help. Thanks!

Update:

This will be my list after merging the first column of all the tables (it will be very long - about 200.000 rows with a mix of 46.000 different genes- so this is just a short example):

gene A
gene B
gene C
gene D
gene A
gene E
gene F
gene A
gene C
gene A
gene B
gene D
gene A
gene E
gene B
gene A
gene C


So we have 6 times gene A, 3 times gene B, 3 times gene C, 2 times gene D, 2 times gene E and only 1 time gene E. Totally we have 17 genes. That makes 35% for gene A, 18% for gene B and 18% for gene C, 12% for gene D and gene E and 5% for gene F. That is what I am looking for. Maybe it isn't that difficult I think.

and and
Answer

You can use the duplicated() function for this.

But first you need to merge all the strings of the first colums in one vector. That you do simply with the c() function. If your tables are already in one list, or in one dataframe it is easier. May be you also can use a loop, that you don't need to write so much, which depends on the name of your object. It would be useful if I have a minimal working example.

merge.first <- c(table1[,1], table2[,1], table3[,1],.... )

Than you search for duplicates:

position.dup <- duplicated(merge.first)

just in case you have more than two duplicates:

names(table(merge.first[position.dup])

for calculated the number of the duplicates you use the sum() function:

sum(position.dup)

And how you calculate the percentage I don't understand what you mean with this. In your example you have two overlaps by ten rows, that make a percentage of 20% and not 28%. So I unfortunately don't know what you need.

edit: now I have the same result like you:

> merge.vector
 [1] "A" "B" "C" "D" "A" "E" "F" "A" "C" "A" "B"
[12] "D" "A" "E" "B" "A" "C"
> round((table(merge.vector) / length(merge.vector) ) * 100)
merge.vector
 A  B  C  D  E  F 
35 18 18 12 12  6 

this line do what you want:

round((table(merge.vector) / length(merge.vector) ) * 100)