jnmf jnmf - 6 months ago 19
Perl Question

Get shortest and longest sequence in file

I'm trying to get the shortest and longest sequence in a file containing multiple genbank-like entries. example of the file:

LOCUS NM_182854 2912 bp mRNA linear PRI 20-APR-2016
DEFINITION Homo sapiens mRNA.
ACCESSION NM_182854
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo.

ORIGIN
1 gggcgatcag aagcaggtca cacagcctgt ttcctgtttt caaacgggga acttagaaag
61 tggcagcccc tcggcttgtc gccggagctg agaaccaaga gctcgaaggg gccatatgac
//

LOCUS NM_001323410 6992 bp mRNA linear PRI 20-APR-2016
DEFINITION Homo sapiens mRNA.
ACCESSION NM_001323410
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo.

ORIGIN
1 actacttccg gcttccccgc cccgccccgt ccccgggcgt ctccattttg gtctcaggtg
61 tggactcggc aagaaccagc gcaagaggga agcagagtta tagctacccc ggc
//


I'd like to print the accession number, the type of the organism from the shortest sequence and the longest sequence

my code so far:

#!/usr/bin/perl

use strict;
use warnings;

print "enter file path\n";

while (my $line = <>){
chomp $line;
my @record = ($line);

foreach my $file(@record){
open(IN, "$file") or die "\n error opening file \n;/\n";

$/="//";

while (my $line = <IN>){
my @gb_seq = split ("ORIGIN", $line);
my $definition = $gb_seq[0];
my $sequence = $gb_seq[1];

$definition =~ m/ORGANISM[\s\t]+(.+)[\n\s\t]+/;
my $organism = $1;

if ($definition =~ m/ACCESSION[\s\t]+(\D\D_\d\d\d\d\d\d(\d*))[\n\s\t]+/){
my $accession = $1;

$sequence =~ s/\d//g;
$sequence =~ s/[\n\s\t]//g;
my $size = length($sequence);
my @sorted_keys = sort { $a <=> $b } keys my %size;
my $shortest = $sorted_keys[0];
my $longest = $sorted_keys[-1];

print "this is the shortest: $accession $organism size: $shortest\n";
print "this is the longest: $accession $organism size: $longest\n";
}
}}}
exit;


I thought about putting the length in a hash to get the shortest and the longest but something is wrong there. I get these errors:

Use of uninitialized value $organism in concatenation (.) or string at test.pl line 39, <IN> chunk 1
Use of uninitialized value $shortest in concatenation (.) or string at test.pl line 39, <IN> chunk 1.
Use of uninitialized value $longest in concatenation (.) or string at test.pl line 40, <IN> chunk 1.


What part should I change? Thanks

Answer

You state that you want two pieces of data - the accession and the organism - for the longest and shortest sequence. This means your hash values need to store two elements. As well as that, when you use '//' as a record separator, the '//' still appears on the end of each record. So, when you filter out whitespace and digits from you sequence, you're still left with '//' on the end. When I ran your code through the debugger, I was finding the lengths were all out by 2 because of this.

A couple of other things:

  1. When using regexs, use 'extended mode', /x, so you can include whitespace for readabillity
  2. you presume a successful match when you dig out $definition - better to test your regexs and assign on match, die on missmatch
  3. Rather than store the length in the hash (and lose the sequence itself), you might as well store the sequence and calculate the lengths later;
  4. I renamed the variable $line to $chunk as it contains several lines
  5. All the stuff to do with calculating the shortest and longest and printing the resuts needs to move out of the loop. In its place, you simply need to make an entry into the hash. As described above, the hash values need to be an array with two values - the accession and the organism.
  6. You remove digits from the sequence in one command and then whitespace from the sequence in another - might as well do them both togeather. While we're at it, might as well remove the '/'s on the end of the record.

Given the mods above, I get;

use v5.14;
use warnings;

print "Enter file path: ";
chomp(my $filename = <>);
open(IN, $filename) or die "\n error opening file \n;/\n";

$/ = "//" ;

my %organisms ;
while (my $chunk = <IN>)  {
    next if $chunk =~ /^\s*\n\s*$/ ;
    my ($definition , $sequence) = split "ORIGIN", $chunk ;

    my $organism ;
    $definition =~ m/ ORGANISM [\s\t]+ (.+) [\n\s\t]+ /x
        ? $organism = $1
        : die "Couldnt find ORGANISM line" ;

    my $accession ;
    $definition =~ m/ ACCESSION [\s\t]+ (\D\D _ \d{6} (\d*))  [\n\s\t]+ /x
        ? $accession = $1
        : die "Cant find ACCESSION line" ;

    $sequence =~ s/[\d\n\s\t\/]//g;
    $organisms{ $sequence } = [ $accession , $organism ] ;
}


my @sorted_keys = sort { length $a  <=>  length $b } keys %organisms ;
my $shortest = $sorted_keys[0];
my $longest  = $sorted_keys[-1];

say "this is the shortest: ",  $organisms{$shortest}->[0],
                        ", ",  $organisms{$shortest}->[1],
                   " size: ",  length $shortest, "\n",
               " sequence: ",  $shortest ;

say  "this is the longest: ",  $organisms{$longest}->[0],
                        ", ",  $organisms{$longest}->[1],
                   " size: ",  length $longest, "\n",
               " sequence: ",  $longest ;

exit;

when ran on your data, it produces;

$ ./sequence.pl
Enter file path: data.txt
this is the shortest: NM_001323410, Homo sapiens size: 113
 sequence: actacttccggcttccccgccccgccccgtccccgggcgtctccattttggtctcaggtgtggactcggcaagaaccagcgcaagagggaagcagagttatagctaccccggc
this is the longest: NM_182854, Homo sapiens size: 120
 sequence: gggcgatcagaagcaggtcacacagcctgtttcctgttttcaaacggggaacttagaaagtggcagcccctcggcttgtcgccggagctgagaaccaagagctcgaaggggccatatgac

UPDATE The problem with the code above is that if the same sequence appears in two chunks, then data is going to be overwritten in the hash and lost. Below is an updated version that stores data in an array of arrays which will advoid the problem. It produces exactly the same output:

use v5.14;
use warnings;

print "Enter file path: ";
chomp(my $filename = <>);
open(IN, $filename) or die "\n error opening file \n;/\n";

$/ = "//" ;

my @organisms ;
while (my $chunk = <IN>)  {
    next if $chunk =~ /^\s*\n\s*$/ ;
    my ($definition , $sequence) = split "ORIGIN", $chunk ;

    my $organism ;
    $definition =~ m/ ORGANISM [\s\t]+ (.+) [\n\s\t]+ /x
        ? $organism = $1
        : die "Couldnt find ORGANISM line" ;

    my $accession ;
    $definition =~ m/ ACCESSION [\s\t]+ (\D\D _ \d{6} (\d*))  [\n\s\t]+ /x
        ? $accession = $1
        : die "Cant find ACCESSION line" ;

    $sequence =~ s/[\d\n\s\t\/]//g;
    push @organisms, [$organism , $accession , $sequence] ;
}


my @sorted_organisms = sort { length $a->[2]  <=>  length $b->[2] }  @organisms ;

my ($organism , $accession , $sequence) = @{ $sorted_organisms[0] };
say "this is the shortest: $accession, $organism, size: ",
    length $sequence, "\n", " sequence: ",  $sequence ;

($organism , $accession , $sequence) = @{ $sorted_organisms[-1] };
say "this is the longest: $accession, $organism, size: ",
    length $sequence, "\n", " sequence: ",  $sequence ;

exit;