Satheesh Siva Nallathambi Satheesh Siva Nallathambi - 6 months ago 12
Perl Question

Cluster the pattern by positions?

I have input file as follow.

ggaaaa (973026 to 973032) ctggag (1849680 to 1849686) = 6
ggaaaa (973056 to 973062) ctggag (1849706 to 1849712) = 6
ggaaaa (97322 to 97328) ctggag (184962 to 184968) = 6
cctgtggataacctgtgga (1849554 to 1849572) tccacaggttatccacagg (1849615 to 1849633) = 19
ggcccccccggagtt (470079 to 470093) aactccgggggggcc (1849574 to 1849588) = 15
ctggag (18497062 to 18497068) ggaaaa (9730562 to 9730568) = 6


First string is pattern with in bracket pattern position. Second string is repeat with in bracket repeat position. First 3 lines pattern and 6th line repeat are same but positions are different. So i want to make it as a cluster my output like this

ggaaaa==>(973026 to 973032)(973056 to 973062)(97322 to 97328)(9730562 to 9730568) ctggag==>(1849680 to 1849686)(1849706 to 1849712)(184962 to 184968)(18497062 to 18497068) 6 8
cctgtggataacctgtgga==>(1849554 to 1849572) tccacaggttatccacagg==>(1849615 to 1849633) 19 2
ggcccccccggagtt==>(470079 to 470093) aactccgggggggcc==>(1849574 to 1849588) 15 2


So first string are pattern and followed by their positions Second string are repeat and followed by their positions and TAB seperator is length of string again TAB separator is total of the pattern and repeat positions

I tried, small file i am getting output but large size file not getting output. I pasted my code below.

my $file = $_[0];
my %hashA;
my %hashB;
my @sorted;
my $i = 1;
my $j=$k=0;
my $tmplen = $len = 0;

my @sorted = `sort -nk10 $file`;

push(@sorted,"***");

open (FLWR,">$file") or die "File can't open $!";

$lengt = $sorted[0];
print FLWR $lengt;
my $linelen = @sorted;

while($i < $linelen)
{
($seqs,$len) = split(/\=/,$sorted[$i]);
$len =~s/\s+//g;
my($first,$second,$third,$fourth) = split(/\s+(?!to|\d+)/,$seqs);
if($len != $tmplen || $sorted[$i] eq "***")
{
if($tmplen != 0)
{
foreach $Alev2 (sort keys %{$hashA{$tmplen}})
{
foreach $Alev3 (sort keys %{$hashA{$tmplen}{$Alev2}})
{
foreach $Blev2 (sort keys %{$hashB{$tmplen}})
{
foreach $Blev3 (sort keys %{$hashB{$tmplen}{$Blev2}})
{
if($Alev3 eq $Blev3 && $Alev2 != $Blev2)
{
($Akey) = keys (%{$hashA{$tmplen}{$Blev2}});
($Akey1) = keys (%{$hashA{$tmplen}{$Blev2}{$Akey}});

foreach $Blev4 (sort keys %{$hashB{$tmplen}{$Blev2}{$Blev3}})
{
$hashA{$tmplen}{$Alev2}{$Alev3}{$Blev4}++;
$hashB{$tmplen}{$Alev2}{$Akey}{$Akey1}++ ;
}
delete($hashB{$tmplen}{$Blev2});
delete($hashA{$tmplen}{$Blev2});
}
}
}
}
}
}
$tmplen = $len;
}
if($first ne $dump_first)
{
$dump_first = $first;
$j++;
}

$hashA{$tmplen}{$j}{$dump_first}{$second}++;
$hashB{$tmplen}{$j}{$third}{$fourth}++;

$i++;
}

foreach $s1(sort keys %hashA)
{
foreach $s2 (sort keys %{$hashA{$s1}})
{
my $seq_concat = "";
my $a_concat = "";
my $b_concat = "";
my $a_inc = 0;
my $b_inc = 0;
foreach $s3 (sort keys %{$hashA{$s1}{$s2}})
{
next if($s3 eq "");
$Aseq_concat = "$s3==>";

foreach $s4 (sort keys %{$hashA{$s1}{$s2}{$s3}})
{
$a_inc++;
$a_concat .= $s4;
}
}

$s3 = "";
foreach $s3 (sort keys %{$hashB{$s1}{$s2}})
{
next if($s3 eq "");

$Bseq_concat = "$s3==>";
foreach $s4 (sort keys %{$hashB{$s1}{$s2}{$s3}})
{
$b_inc++;
$b_concat .= $s4;
}
}

next if($b_concat eq "");
$Bseq_concat = uc($Bseq_concat);
$b_concat = uc($b_concat);
$Aseq_concat = uc($Aseq_concat);
$a_concat = uc($a_concat);
if($a_inc > $b_inc)
{
print FLWR $Bseq_concat.$b_concat,"\t",$Aseq_concat,$a_concat;
}
else
{
print FLWR $Aseq_concat.$a_concat,"\t",$Bseq_concat,$b_concat;
}
print FLWR "\t$s1\t";
print FLWR $a_inc+$b_inc;
print FLWR "\n";

}
}

Answer

I'd prefer to suggest some updates to your code rather than hand out a completely re-worked solution - but I simply got lost in the level of nesting you have above.

Just a few key points;

  1. The best attack with this is identifying that the ranges are the same format - therefore use a regex with 'global', /g option in a while loop.
  2. Having decided to center the solution on a regex, its best to use 'extended mode' and lots of whitespace and comments.
  3. The core data structure is a hash of hashes. The first level is keyed on the pattern string, the inner one keyed on the range specification string.
  4. Your specification implies that all patterns on a given line of the data will be of the same length - the code I give below relies on this fact.
  5. I added some checks to the given pattern length and the start and end positions - if the data is rock solid, maybe you don't need them?
  6. Your requested output format added complexity to the code - if there was one report line for each pattern the code would be much simpler. As it is, a pattern appears at a certain line number in the report if it first appears on the corresponding line in the data. Recording the line that a pattern first appears added significantly to the code.
  7. The code is written unix filter style - data supplied via STDIN and printed on STDOUT. Errors and warnings on STDERR

So, given those points;

use v5.12;
use warnings;

my $pattern_regex = qr/

    (\w+)             # capture actual pattern to $1
    \s*               # optional whitespace after pattern
    (                 # capture the following to $2
        \(            # literal open bracket
        (\d+)         # capture start pos to $3
        \s* to \s*
        (\d+)         # capture end pos to $4
        \)            # literal close bracket
    )                 # close capture whole range spec to $1
    \s*               # gobble up any whitespae between patterns

/x ;                  # close regex - extended mode

my %ranges ;
my @first_seen_on_line ;

while (<>) {
    chomp ;
    my ($patterns_str, $given_length) = split /\s*=\s*/ ;
    die "No length on line $." unless defined $given_length ;

    # Repeatedly look for patterns and range-specifications
    while ( $patterns_str =~ /$pattern_regex/g )  {
        my ($pattern, $range_spec, $start_pos, $end_pos) = ($1,$2,$3,$4);
        warn "Incorrect length for '$pattern' on line $.\n"
            unless length($pattern) == $end_pos - $start_pos
               &&  length($pattern) == $given_length ;

        # Is this the first time we've seen this pattern?
        if ( defined $ranges{ $pattern } )  {
            # No its not - add this range to the hash of ranges for this pattern
            $ranges{ $pattern }{ $range_spec }++ ;
        }
        else {
            # Yes it is -  record the fact that it was on this line and
            # initialize the hash of ranges for this pattern
            push @{ $first_seen_on_line[ $. ] }, $pattern ;
            $ranges{ $pattern } = { $range_spec => 1 } ;
        }
    }
}

for my $line (1 .. $#first_seen_on_line)  {
    # Might not be anything to do for this line
    next unless defined $first_seen_on_line[$line] ;

    # Get the patterns that first appeared on this line
    my @patterns =  @{ $first_seen_on_line[$line] } ;

    my ($pat_length , $range_count) ;
    for my $pat (@patterns)  {
        # Get all the ranges for this pattern and print them
        my @ranges = keys %{ $ranges{ $pat } };
        print $pat,  '==>', @ranges, "\t" ;
        $range_count += @ranges ;
        $pat_length = length($pat) ;
    }
    print $pat_length, "\t";
    print $range_count, "\n" ;
    # print length $pat, "\t", scalar @ranges, "\n" ;
}

Ran on the data above, it produces;

gaaaa==>(973026 to 973032)(973056 to 973062)(97322 to 97328)(9730562 to 9730568)    ctggag==>(1849680 to 1849686)(1849706 to 1849712)(184962 to 184968)(18497062 to 18497068)   6   8
cctgtggataacctgtgga==>(1849554 to 1849572)  tccacaggttatccacagg==>(1849615 to 1849633)  19  2
ggcccccccggagtt==>(470079 to 470093)    aactccgggggggcc==>(1849574 to 1849588)  15  2