J.Carter J.Carter - 1 month ago 12
Perl Question

Multi-column file comparison and range extraction

Pardon me for asking a question without any coding effort. But it seems too much difficult to me.

I have a data file with tab separated three data columns (and some repetitive header lines) as:

Sequence ../Output/yy\Programs\NP_416485.4 alignment. Using default output format...
# ../Output/Split_Seq/NP_415931.4.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score

0 0.89 u-p
1 -5.79 ---
2 0.85 yui
3 0.51 uio
4 0.66 -08
Sequence ../Output/yy\Programs\YP_986467.7 alignment. Using default output format...
# ../Output/Split_Seq/YP_986467.7.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score

0 0.001 -s-
1 0.984 ---
2 0.564 -fg
3 0.897 -sr


From the second data column, for those value(s) which are more than 0.5, I want to extract the corresponding first column number (or range).

For the above Input, the output would be:

NP_416485.4: 1, 3-5
YP_986467.7: 2-4


Here, "NP_416485.4" and "YP_986467.7" are from header descriptor (after \Programs). (Note that, the actual value for "NP_416485.4" for example, should be, "NP_416485.4: 0, 2-4", but I increases all of them with +1 as I don't want to start with 0).

Thanks for your consideration. I would appreciate any help. Thank you

Answer

Here is one approach. In case you would have a DOS data file on a Unix machine, I used \r?\n to match a new line, so it will work for all cases:

use feature qw(say);
use strict;
use warnings;

my $file_name = 'input.txt';
open ( my $fh, '<', $file_name ) or die "Could not open file '$file_name': $!";
my $str = do { local $/; <$fh> };
close $fh;

my @chunks = $str =~ /(Sequence(?:.(?!Sequence))*)/sg;
my %ids;
for my $cstr ( @chunks ) {
    my ( $id, $data ) = $cstr 
        =~/Split_Seq\/(\S+)\.fasta.*?\r?\n\r?\n(.*)$/s;
    my @lines = split /\n/, $data;
    my @vals;
    for my $line ( @lines ) {
        my @fields = split " ", $line;
        push ( @vals, $fields[0] + 1 ) if $fields[1] > 0.5;
    }
    $ids{$id} = \@vals;
}

for my $id ( keys %ids ) {
    my @tmp = sort { $a <=> $b } @{ $ids{$id} };
    my ( $first, $last );
    my @rr;
    for my $i (0..$#tmp) {
        if ( $i == 0 ) {
            $first = $tmp[0];
            $last = undef;
        }
        if ( $i < $#tmp && ($tmp[$i] == ($tmp[$i+1] - 1 )) ) {
            $last = $tmp[$i+1];
            next;
        }
        if ( defined $last ) {
            push @rr, "$first-$last";
            $last = undef;
        }
        else {
            push @rr, $tmp[$i];
        }
        $first = ( $i < $#tmp ) ? $tmp[$i+1] : undef;
    }
    say "$id: ", join ",", @rr;
}

Output:

NP_416485.4: 1,3-5
YP_986467.7: 2-4