jnmf jnmf - 6 months ago 12
Perl Question

Reverse complement of fasta file

I'm trying to get the reverse complement of RNA in a multi fasta file

input:

>cel-mir-39 MI0010 C elegans miR-39
UAUACCGAGAGCCCAGCUGAUUUCGUCUUGGUAAUAAGCUCGUCAUUGAGAUUAUCACCGGGUGUAAAUCAGCUUGGCUCAAAAAAAA

>cel-let-7 MI0001 C elegans let-7
UACACUGUGGAUCCGGUGAGGUAGUAGGUUGUAUAGUUUGGAAUAUUACCACCGGUGAACUAUGCAAUUUUCUACCUUACCGGAGGGGGGG


output:

>cel-mir-39 MI0010 C elegans miR-39
UUUUUUUUGAGCCAAGCUGAUUUACACCCGGUGAUAAUCUCAAUGACGAGCUUAUUACCAAGACGAAAUCAGCUGGGCUCUCGGUAUA

>cel-let-7 MI0001 C elegans let-7
CCCCCCCUCCGGUAAGGUAGAAAAUUGCAUAGUUCACCGGUGGUAAUAUUCCAAACUAUACAACCUACUACCUCACCGGAUCCACAGUGUA


But I'm getting this instead:

UUUUUUUUGAGCCAAGCUGAUUUACACCCGGUGAUAAUCUCAAUGACGAGCUUAUUACCAAGACGAAAUCAGCUGGGCUCUCGGUAUA
93-Rim snucele G 0100IM 93-rim-leg
CCCCCCCUCCGGUAAGGUAGAAAAUUGCAUAGUUCACCGGUGGUAAUAUUCCAAACUAUACAACCUACUACCUCACCGGAUCCACAGUGUA
7-tel snucele G 1000IM 7-tel-leg


my code:

#!/usr/bin/perl
use strict;
use warnings;

print "type in the path of the file\n";
my $file_name = <>;
chomp($file_name);

open (FASTA, $file_name) or die "error #!";

$/ = ">";
<FASTA>;
while (my $entry = <FASTA>){
$entry = reverse $entry;
$entry =~ tr/ACGUacgu/UGCAugca/;
print "$entry \n";
}

close(FASTA);


How can I reverse only the sequence and not the header?
Thanks

Answer

Updated to deal with the need to combine lines.


Reading records separated by > is a nice idea as it gives you the whole chunk at a time. However, here you want to process and merge lines but not the header, thus distinguishing between lines. It is clearer to read line by line.

The sequence-line is specific: all caps and nothing else. The blank line separates records to process. The remaining possibility is the header. The sequence is assembled by joining lines that match its pattern, and once we hit the blank line it is processed and printed.

open (FASTA, $file_name) or die "error $!";

# sequence, built by joining lines =~ /^[A-Z]+$/
my $sequence = '';

while (my $entry = <FASTA>)
{
    if ($entry =~ m/^[A-Z]+$/) {
        # Assemble the sequence from separate lines
        chomp($entry);
        $sequence .= $entry;
    }
    elsif ($entry =~ m/^\s*$/) { 
        # process and print the sequence and blank line, reset for next
        $sequence = reverse $sequence;
        $sequence =~ tr/ACGUacgu/UGCAugca/;
        print "$sequence\n";
        print "\n";
        $sequence = '';
    }
    else { # header
        print $entry;
    }
}

# Print the last sequence if the file didn't end with blank line    
if (length $sequence) {
    $sequence = reverse $sequence;
    $sequence =~ tr/ACGUacgu/UGCAugca/;
    print "$sequence\n";
}

The ^ and $ are anchors, for the beginning and end of string. So the regex matching the sequence requires that the whole line be strictly caps. The other regex allows only optional space \s*, specifying a blank line.

The sequence processing is copied from the question.

Comments