geoffjentry geoffjentry - 2 days ago 4
R Question

Finding overlap in ranges with R

I have two data.frames each with three columns: chrom, start & stop, let's call them rangesA and rangesB. For each row of rangesA, I'm looking to find which (if any) row in rangesB fully contains the rangesA row - by which I mean

rangesAChrom == rangesBChrom, rangesAStart >= rangesBStart and rangesAStop <= rangesBStop
.

Right now I'm doing the following, which I just don't like very much. Note that I'm looping over the rows of rangesA for other reasons, but none of those reasons are likely to be a big deal, it just ends up making things more readable given this particular solution.

rangesA:

chrom start stop
5 100 105
1 200 250
9 275 300


rangesB:

chrom start stop
1 200 265
5 99 106
9 275 290


for each row in rangesA:

matches <- which((rangesB[,'chrom'] == rangesA[row,'chrom']) &&
(rangesB[,'start'] <= rangesA[row, 'start']) &&
(rangesB[,'stop'] >= rangesA[row, 'stop']))


I figure there's got to be a better (and by better, I mean faster over large instances of rangesA and rangesB) way to do this than looping over this construct. Any ideas?

Answer

This would be a lot easier / faster if you can merge the two objects first.

ranges <- merge(rangesA,rangesB,by="chrom",suffixes=c("A","B"))
ranges[with(ranges, startB <= startA & stopB >= stopA),]
#  chrom startA stopA startB stopB
#1     1    200   250    200   265
#2     5    100   105     99   106
Comments