BioInfoNovice BioInfoNovice - 4 months ago 13
Linux Question

Using awk to print pairs of records having overlapping range of values between two columns

I have different records corresponding to ranges with start($6) and stop($7).
What I want to do is to print out all pairs of records having overlapping ranges.

For example, my data is as follow:

id1 0 376 . scaffold1 5165761 5166916
id2 0 366 . scaffold1 2297244 2298403
id3 155 456 . scaffold1 692777 693770
id4 185 403 . scaffold1 102245 729675


What I want is a result like

id3 id4


because the range of id4 is overlapping with id3.
I have been searching the solutions all over the internet but it seems there is nothing approaching to my problem.

I would really appreciate if some might give some advice.




After following the advice of some from the below replies, I did try this code which did work !

awk '{start[$1]=$6;stop[$1]=$7;} END {for(i in start) {for(j in stop) {if(start[i] >= start[j] && start[i] <= stop[j]) print i,j}}}' file | awk '{if($1!=$2) print}' -


The processing time was quite short...it was done after not even 1 minute for a file with 1400 records.

Answer
$ cat tst.awk
{
    beg[$1] = $6
    end[$1] = $7
    ids[++numIds] = $1
}
END {
    for (i=1; i<=numIds; i++) {
        idI = ids[i]
        for (j=1; j<=numIds; j++) {
            idJ = ids[j]
            if (idI != idJ) {
                if ( ( (beg[idI] >= beg[idJ]) && (beg[idI] <= end[idJ]) ) ||
                     ( (end[idI] >= beg[idJ]) && (end[idI] <= end[idJ]) ) ) {
                    if ( !seen[(idI<idJ ? idI FS idJ : idJ FS idI)]++ ) {
                        print idI, idJ
                    }
                }
            }
        }
    }
}

$ awk -f tst.awk file
id3 id4

The input file you provided in your question doesn't cover many cases so given this input file with a lot more overlap variants in it:

$ cat file
id1 185 403 . scaffold1 10  20
id2 185 403 . scaffold1 11  19
id3 185 403 . scaffold1  9  10
id4 185 403 . scaffold1 20  21
id5 185 403 . scaffold1  9  11
id6 185 403 . scaffold1 19  21
id7 185 403 . scaffold1 10  20
id8 185 403 . scaffold1  1   8

Try the above:

$ awk -f tst.awk file
id1 id3
id1 id4
id1 id5
id1 id6
id1 id7
id2 id1
id2 id5
id2 id6
id2 id7
id3 id5
id3 id7
id4 id6
id4 id7
id5 id7
id6 id7

vs the scripts + pipe you provided at the end of your answer:

$ awk '{start[$1]=$6;stop[$1]=$7;} END {for(i in start) {for(j in stop) {if(start[i] >= start[j] && start[i] <= stop[j]) print i,j}}}' file | awk '{if($1!=$2) print}' -
id3 id5
id4 id6
id4 id7
id4 id1
id5 id3
id6 id7
id6 id1
id6 id2
id7 id3
id7 id5
id7 id1
id1 id3
id1 id5
id1 id7
id2 id5
id2 id7
id2 id1

and notice that your scripts report the overlap between some (but not all) of the ids twice:

id1 id7
id7 id1
id3 id5
id5 id3

while my script only reports them once courtesy of !seen[(idI<idJ ? idI FS idJ : idJ FS idI)]++.

Comments