J.Carter J.Carter - 9 days ago 5
Perl Question

Range overlapping number and percentage calculation

I want to calculate overlap number(#) and percentage (%) from a series of ranges values distributed in four different files initiated with a specific identifier(id) (like NP_111111.4) . The initial list of ids are taken from file1.txt (starting file) and if the id matches with ids of other file, overlaps are calculated. Suppose my files are like this:

file1.txt

NP_111111.4: 1-9 12-20 30-41
YP_222222.2: 3-30 40-80


file2.txt

NP_111111.4: 1-6, 13-22, 31-35, 36-52
NP_414690.4: 360-367, 749-755
YP_222222.2: 19-24, 22-40


file3.txt

NP_418214.2: 1-133, 135-187, 195-272
YP_222222.2: 1-10


file4.txt

NP_418119.2
YP_222222.2 GO:0016878, GO:0051108
NP_111111.4 GO:0005887


From these input file, I want to create a .csv or excel output with separate columns with header as:

id overlap_file1_file2(#) overlap_file1_file2(%) overlap_file1_file3(#) overlap_file1_file3(%) overlap_file1_file2_file3(#) overlap_file1_file2_file3(%) Go_Terms(File4)


I am learning perl and found a perl module "strictures" for this type of range comparison. I am calculating overlapping number and percentage from two ranges as:

#!/usr/bin/perl

use strictures;
use Number::Range;

my $seq1 = Number::Range->new(8..356); #Start and stop for file1.txt
my $seq2 = Number::Range->new(156..267); #Start and stop for file2.txt

my $overlap = 0;
my $sseq1 = $seq1->size;
my $percent = (($seq2->size * 100) / $seq1->size);

foreach my $int ($seq2->range) {
if ( $seq1->inrange($int) ) {
$overlap++;
}
else {
next;
}
}


print "Total size= $sseq1 Number overlapped= $overlap Percentage overlap= $percent \n";


But I could not find a way to match ids of (file1.txt) with other files to extract specific information and to print them in a output csv file.

Please help. Thanks for your consideration.

Answer

This is a fragile solution in that it can only check 3 files for overlaps. If more files are involved, the code would need to be restructured. It uses Set::IntSpan to calculate the overlaps (and percent of overlaps.

#!/usr/bin/perl
use strict;
use warnings;
use Set::IntSpan;
use autodie;

my @files = qw/file2 file3/;

my %data;
my %ids;
open my $fh1, '<', 'file1';

while (<$fh1>) {
    chomp;
    my ($id, $list) = split /:\s/;
    $ids{$id}++;
    $data{file1}{$id} = Set::IntSpan->new(split ' ', $list);
}
close $fh1;

for my $file (@files) {
    open my $fh, '<', $file;
    while (<$fh>) {
        chomp;
        my ($id, $list) = split /:\s/;
        next unless exists $ids{$id};

        $data{$file}{$id} = Set::IntSpan->new(split /,\s/, $list);
    }
    close $fh;
}

my %go_terms;
open my $go, '<', 'file4';

while (<$go>) {
    chomp;
    my ($id, $terms) = split ' ', $_, 2;
    $go_terms{$id} = $terms =~ tr/,//dr;
}
close $go;

my %output;

for my $file (@files) {
    for my $id (keys %ids) {
        my $count = ($data{file1}{$id} * $data{$file}{$id})->size;
        my $percent = sprintf "%.0f", 100 * $count / $data{file1}{$id}->size;

        $output{$id}{$file} = [$count, $percent];   
    }   
}

for my $id (keys %ids) {
    my $count = ($data{file1}{$id} * $data{$files[0]}{$id} * $data{$files[1]}{$id})->size;
    my $percent = sprintf "%.0f", 100 * $count / $data{file1}{$id}->size;

    $output{$id}{"file2 file3"} = [$count, $percent];
}

# output saved as f2.csv
print join(",", qw/ID   f1f2_overlap   f1f2_%overlap
                        f1f3_overlap   f1f3_%overlap
                      f1f2f3_overlap f1f2f3_%overlap Go_terms/), "\n";

for my $id (keys %output) {
    print "$id,";

    for my $file (@files) {
        my $aref = $output{$id}{$file};
        print join(",", @$aref), ",";   
    }
    my $aref = $output{$id}{"file2 file3"};
    print join(",", @$aref), ",";
    print +($go_terms{$id} // ''), "\n";
}

The Excel sheet looks like this.

enter image description here

Comments