Tomb Tomb - 7 months ago 36
Perl Question

Parsing a GenBank file

I wrote this code to parse a GenBank file so I could get the accession number, the definition, the size of the file and the DNA sequence

My input file looks like this

LOCUS NM_001323410 6992 bp mRNA linear PRI 20-APR-2016
DEFINITION Homo sapiens proteasome inhibitor subunit 1 (PSMF1), transcript
variant 6, mRNA.
ACCESSION NM_001323410
VERSION NM_001323410.1 GI:1020738720
KEYWORDS RefSeq.
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
....
ORIGIN
1 actacttccg gcttccccgc cccgccccgt ccccgggcgt ctccattttg gtctcaggtg
61 tggactcggc aagaaccagc gcaagaggga agcagagtta tagctacccc ggccgcggag
121 ccggctcact gcactacccc cgcccccttc tttcctccag acgccgaagt cgcgggcgct
181 catggcgggc ctggaggtac tgttcgcatc ggcagcgccg gccatcacct gcaggcagga
241 cgcgctcgtc tgcttcttgc attgggaagt ggtgacacac ggttacttcg gcttgggtgt


My output looks like this:

NM_001323410 | Homo sapiens proteasome inhibitor subunit 1 (PSMF1), transcript | 6992 bp
actacttccggcttccccgccccgccccgtccccgggcgtctccattttggtctcaggtgtggactcg ...


so I have this code working fine:

#!/usr/bin/perl -w

use strict;

print "type in the name of a file\n";
my $filename = <>;
chomp($filename);

unless ( open( GBFILE, $filename ) ) {
print "Cannot open GenBank file \"$filename\"\n\n";
exit;
}

my $gbfile = "";

while ( my $line = <GBFILE> ) {
$gbfile .= $line;
}

close(GBFILE);

my @gbfiles = split( /\/\/\s*/s, $gbfile );

for ( my $i = 0; $i < @gbfiles; $i++ ) {

my $record = $gbfiles[$i];

#separate the annotation from the sequence data
my ( $annotation, $dna ) = ( $record =~ /^(LOCUS.*ORIGIN\s*\n) (.*)/s );

#get length of file in bp: 0 to 9 intergers plus space plus bp
my ($length) = ( $annotation =~ /([0-9]+\s+bp)/ );

my ($definition) = ( $annotation =~ /(DEFINITION.+\n)/ );
chomp($definition);
$definition =~ s/^DEFINITION\s*//;

#accession number
my ($accession) = ( $annotation =~ /(ACCESSION.+[0-9])/ );
$accession =~ s/^ACCESSION\s*//;

#organism species
my ($organism) = ( $annotation =~ /(ORGANISM\s+.+\n)/ );
chomp($organism);
$organism =~ s/^\s*ORGANISM\s*//;

#molecule type
my ($type) = ( $annotation =~ /(mol_type.+\n)/ );
chomp($type);
$type =~ s/mol_type=//;
$type =~ s/"//;
$type =~ s/"//;

#take off spaces/numbers from $dna
$dna =~ s/\s//g;
$dna =~ s/[0-9]+//g;

#conditions
if ( ( $type =~ m/mRNA/i )
&& ( $organism =~ m/homo sapiens/i )
&& ( $accession =~ m/[a-z]{2}_[0-9]{6,}/i ) ) {

print "$accession | $definition | $length\n";
print "$dna\n";
print "\n";

#if conditions don't apply in file
}
else {
print "conditions don't apply\n";
}
}

exit;


I'm trying to change the code so it's nicer like putting things in a loop like this example from a book I'm learning from:

for my $line (@gbfiles) {
if($line =~ /^DEFINITION/) {
$line =~ s/^DEFINITION\s*//;
$definition = $line;
$flag = 1;
}elsif($line =~ /^ACCESSION/) {
$line =~ s/^ACCESSION\s*//;
$accession = $line;
$flag = 0;
}elsif($flag) {
chomp($definition);
$definition .= $line;
}elsif($line =~ /^ ORGANISM/) {
$line =~ s/^\s*ORGANISM\s*//;
$organism = $line;
}
}


but I can't manage to substitute this into my code, even if I declare things like in the book:

my @gbfiles = ( );
my $accession = '';
my $organism = '';
my $definition = '';
my $flag = 0;


I get this error

Global symbol "$..." requires explicit package name


Is there a way to modify my code and make it shorter and just declare all the variables at once like they do in the book and parse the file in one or two blocks of code?

Answer

If you have access to Bio Perl, you might find a solution such as the following.

#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;

my $in  = Bio::SeqIO->new( -file   => "input.txt",
                           -format => 'GenBank');

while ( my $seq = $in->next_seq ) {
    my $acc = $seq->accession;
    my $length = $seq->length;
    my $definition = $seq->desc;
    my $type = $seq->molecule;
    my $organism = $seq->species->binomial;

    if ($type eq 'mRNA'              &&
        $organism =~ /homo sapiens/i &&
        $acc =~ /[A-Za-z]{2}_[0-9]{6,}/ )
    {
        print "$acc | $definition | $length\n";
        print $seq->seq, "\n";
        print "\n";
    }
}

I was able to capture the 5 variables from a sample GenBank file I have (input.txt). It should simplify your code.