Melanie Julia - 1 year ago 62
R Question

# R percentage of overlaps

I've got about 25 datatables. Now I want to find overlaps in the first column in some of the tables and extract them. Furthermore I want to know how many overlaps there are and how many percent. The output should be a table. Here's an example:

Table1:

``````Gen          Estimate    Std. Error    p-Value
1007_s_at    -0.159699   0.07834       0.04265
1053_at      -0.174647   0.064535      0.0098976
121_at       0.1765678   0.05116854    0.0000657
``````

Table2:

``````Gen        Estimate     Std. Error   p-Value
1494_f_at  0.2222467    0.0553653    0.0075838
121_at     0.873683     0.00898737   0.0088378
1316_at    0.098764     0.098456     0.048899
1007_s_at  0.89723      0.5675389    0.00007865
``````

Table3:

``````Gen        Estimate     Std.Error    p-Value
1007_s_at  0.0864567    0.8931278    0.005542
121_at     0.2378590    0.0236586    0.00005667
1494_f_at  0.4597023    0.9875357    0.0091234
``````

The result should be:

``````Gen
1007_s_at
121_at

Overlapping rate: 20%
``````

I tried the foverlaps function, but it didn't work.

I hope someone could help. Thanks!

Update:

This will be my list after merging the first column of all the tables (it will be very long - about 200.000 rows with a mix of 46.000 different genes- so this is just a short example):

``````gene A
gene B
gene C
gene D
gene A
gene E
gene F
gene A
gene C
gene A
gene B
gene D
gene A
gene E
gene B
gene A
gene C
``````

So we have 6 times gene A, 3 times gene B, 3 times gene C, 2 times gene D, 2 times gene E and only 1 time gene E. Totally we have 17 genes. That makes 35% for gene A, 18% for gene B and 18% for gene C, 12% for gene D and gene E and 5% for gene F. That is what I am looking for. Maybe it isn't that difficult I think.

You can use the `duplicated()` function for this.

But first you need to merge all the strings of the first colums in one vector. That you do simply with the `c()` function. If your tables are already in one list, or in one dataframe it is easier. May be you also can use a loop, that you don't need to write so much, which depends on the name of your object. It would be useful if I have a minimal working example.

``````merge.first <- c(table1[,1], table2[,1], table3[,1],.... )
``````

Than you search for duplicates:

``````position.dup <- duplicated(merge.first)
``````

just in case you have more than two duplicates:

``````names(table(merge.first[position.dup])
``````

for calculated the number of the duplicates you use the `sum()` function:

``````sum(position.dup)
``````

And how you calculate the percentage I don't understand what you mean with this. In your example you have two overlaps by ten rows, that make a percentage of 20% and not 28%. So I unfortunately don't know what you need.

edit: now I have the same result like you:

``````> merge.vector
[1] "A" "B" "C" "D" "A" "E" "F" "A" "C" "A" "B"
[12] "D" "A" "E" "B" "A" "C"
> round((table(merge.vector) / length(merge.vector) ) * 100)
merge.vector
A  B  C  D  E  F
35 18 18 12 12  6
``````

this line do what you want:

``````round((table(merge.vector) / length(merge.vector) ) * 100)
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download