Mike F Mike F - 4 months ago 7
Perl Question

Modifying perl script to not print duplicates and extract sequences of a certain length

I want to first apologize for the biological nature of this post. I thought I should post some background first. I have a set of gene files that contain anywhere from one to five DNA sequences from different species. I used a bash shell script to perform blastn with each gene file as a query and a file of all transcriptome sequences (

all_transcriptome_seq.fasta
) from the five species as the subject. I now want to process these output files (and there are many) so that I can get all subject sequences that hit into one file per gene, with duplicate sequences removed (except to keep one), and ensure I'm getting the length of the sequences that actually hit the query.

Here is what the blastn output looks like for one gene file (columns:
qseqid qlen sseqid slen qframe qstart qend sframe sstart send evalue bitscore pident nident length
)

Acur_01000750.1_OFAS014956-RA-EXON04 248 Apil_comp17195_c0_seq1 1184 1 1 248 1 824 1072 2e-73 259 85.60 214 250
Acurvipes_01000750.1_OFAS014956-RA-EXON04 248 Atri_comp5613_c0_seq1 1067 1 2 248 1 344 96 8e-97 337 91.16 227 249
Acur_01000750.1_OFAS014956-RA-EXON04 248 Acur_01000750.1 992 1 1 248 1 655 902 1e-133 459 100.00 248 248
Acur_01000750.1_OFAS014956-RA-EXON04 248 Btri_comp17734_c0_seq1 1001 1 1 248 1 656 905 5e-69 244 84.40 211 250
Btri_comp17734_c0_seq1_OFAS014956-RA-EXON04 250 Atri_comp5613_c0_seq1 1067 1 2 250 1 344 96 1e-60 217 82.33 205 249
Btri_comp17734_c0_seq1_OFAS014956-RA-EXON04 250 Acur_01000750.1 992 1 1 250 1 655 902 5e-69 244 84.40 211 250
Btri_comp17734_c0_seq1_OFAS014956-RA-EXON04 250 Btri_comp17734_c0_seq1 1001 1 1 250 1 656 905 1e-134 462 100.00 250 250


I've been working on a perl script that would, in short, take the
sseqid
column to pull out the corresponding sequences from the
all_transcriptome_seq.fasta
file, place these into a new file, and trim the transcripts to the
sstart
and
send
positions. Here is the script, so far:

#!/usr/bin/env perl

use warnings;
use strict;
use Data::Dumper;

############################################################################
# blastn_post-processing.pl v. 1.0 by Michael F., XXXXXX
############################################################################

my($progname) = $0;

############################################################################
# Initialize variables
############################################################################

my($jter);

my($com);
my($t1);

if ( @ARGV != 2 ) {
print "Usage:\n \$ $progname <infile> <transcriptomes>\n";
print " infile = tab-delimited blastn text file\n";
print " transcriptomes = fasta file of all transcriptomes\n";
print "exiting...\n";
exit;
}

my($infile)=$ARGV[0];
my($transcriptomes)=$ARGV[1];

############################################################################
# Read the input file
############################################################################
print "Reading the input file... ";
open (my $INF, $infile) or die "Unable to open file";
my @data = <$INF>;
print @data;
close($INF) or die "Could not close file $infile.\n";
my($nlines) = $#data + 1;
my($inlines) = $nlines - 1;
print "$nlines blastn hits read\n\n";

############################################################################
# Extract hits and place sequences into new file
############################################################################
my @temparray;
my @templine;
my($seqfname);

open ($INF, $infile) or die "Could not open file $infile for input.\n";
@temparray = <$INF>;
close($INF) or die "Could not close file $infile.\n";
$t1 = $#temparray + 1;
print "$infile\t$t1\n";
$seqfname = "$infile" . ".fasta";
if ( -e $seqfname ) {
print " --> $seqfname exists. overwriting\n";
unlink($seqfname);
}

# iterate through the individual hits
for ($jter=0; $jter<$t1; $jter++) {
(@templine) = split(/\s+/, $temparray[$jter]);
$com = "./extract_from_genome2 $transcriptomes $templine[2] $templine[8] $templine[9] $templine[2]";
# print "$com\n";
system("$com");
system("cat temp.3 >> $seqfname");
} # end for ($jter=0; $jter<$t1...

# Arguments for "extract_from_genome2"
# // argv[1] = name of genome file
# // argv[2] = gi number for contig
# // argv[3] = start of subsequence
# // argv[4] = end of subsequence
# // argv[5] = name of output sequence


Using this script, here is the output I'm getting:

>Apil_comp17195_c0_seq1
GATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAACA
>Atri_comp5613_c0_seq1
GAGAATTCTAGCATCAGCAGTGAGGCCTGAAATACTCATGCCTATGTGACTATCTAGAGGTATTATTTTTTTTTGATGAGCTGACAGTTCAGAAGAAGCTCTTTTGAGAGCTACAAGAACTGCATACTGTTTATTTTTTACTCCAACTGTTGCTGCTCCAAGCTTTACAGCCTCCATTGCATATTCCACTTGGTGTAAACGCCCCTGAGGACTCCATACCGTAACATCAGAATCATACTGATTACGGA
>Acur_01000750.1
GAATTCTAGCGTCAGCAGTGAGTCCTGAAATACTCATCCCTATGTGGCTATCTAGAGGTATTATTTTTTCTGATGGGCCGACAGTTCAGAGGATGCTCTTTTAAGAGCCACAAGAACTGCATACTCTTTATTTTTACTCCAACAGTAGCAGCTCCAAGCTTCACAGCCTCCATTGCATATTCCACCTGGTGTAAACGTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Btri_comp17734_c0_seq1
GAATCCTTGCATCTGCAGTAAGTCCAGAAATGCTCATTCCAATATGGCTATCTAATGGTATTATTTTTTTCTGGTGAGCAGACAATTCAGATGATGCTCTTTTAAGAGCTACCAGTACTGCAAAATCATTGTTCTTCACTCCAACAGTTGCAGCACCTAATTTGACTGCCTCCATTGCATACTCCACTTGGTGCAATCTTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Atri_comp5613_c0_seq1
GAGAATTCTAGCATCAGCAGTGAGGCCTGAAATACTCATGCCTATGTGACTATCTAGAGGTATTATTTTTTTTTGATGAGCTGACAGTTCAGAAGAAGCTCTTTTGAGAGCTACAAGAACTGCATACTGTTTATTTTTTACTCCAACTGTTGCTGCTCCAAGCTTTACAGCCTCCATTGCATATTCCACTTGGTGTAAACGCCCCTGAGGACTCCATACCGTAACATCAGAATCATACTGATTACGGA
>Acur_01000750.1
GAATTCTAGCGTCAGCAGTGAGTCCTGAAATACTCATCCCTATGTGGCTATCTAGAGGTATTATTTTTTCTGATGGGCCGACAGTTCAGAGGATGCTCTTTTAAGAGCCACAAGAACTGCATACTCTTTATTTTTACTCCAACAGTAGCAGCTCCAAGCTTCACAGCCTCCATTGCATATTCCACCTGGTGTAAACGTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Btri_comp17734_c0_seq1
GAATCCTTGCATCTGCAGTAAGTCCAGAAATGCTCATTCCAATATGGCTATCTAATGGTATTATTTTTTTCTGGTGAGCAGACAATTCAGATGATGCTCTTTTAAGAGCTACCAGTACTGCAAAATCATTGTTCTTCACTCCAACAGTTGCAGCACCTAATTTGACTGCCTCCATTGCATACTCCACTTGGTGCAATCTTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA


As you can see, it's pretty close to what I'm wanting. Here are the two issues I have and cannot seem to figure out how to resolve with my script. The first is that a sequence may occur more than once in the
sseqid
column, and with the script in its current form, it will print out duplicates of these sequences. I only need one. How can I modify my script to not duplicate sequences (i.e., how do I only retain one but remove the other duplicates)? Expected output:

>Apil_comp17195_c0_seq1
GATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAACA
>Atri_comp5613_c0_seq1
GAGAATTCTAGCATCAGCAGTGAGGCCTGAAATACTCATGCCTATGTGACTATCTAGAGGTATTATTTTTTTTTGATGAGCTGACAGTTCAGAAGAAGCTCTTTTGAGAGCTACAAGAACTGCATACTGTTTATTTTTTACTCCAACTGTTGCTGCTCCAAGCTTTACAGCCTCCATTGCATATTCCACTTGGTGTAAACGCCCCTGAGGACTCCATACCGTAACATCAGAATCATACTGATTACGGA
>Acur_01000750.1
GAATTCTAGCGTCAGCAGTGAGTCCTGAAATACTCATCCCTATGTGGCTATCTAGAGGTATTATTTTTTCTGATGGGCCGACAGTTCAGAGGATGCTCTTTTAAGAGCCACAAGAACTGCATACTCTTTATTTTTACTCCAACAGTAGCAGCTCCAAGCTTCACAGCCTCCATTGCATATTCCACCTGGTGTAAACGTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Btri_comp17734_c0_seq1
GAATCCTTGCATCTGCAGTAAGTCCAGAAATGCTCATTCCAATATGGCTATCTAATGGTATTATTTTTTTCTGGTGAGCAGACAATTCAGATGATGCTCTTTTAAGAGCTACCAGTACTGCAAAATCATTGTTCTTCACTCCAACAGTTGCAGCACCTAATTTGACTGCCTCCATTGCATACTCCACTTGGTGCAATCTTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA


The second is the script is not quite extracting the right base pairs. It's super close, off by one or two, but its not exact.

For example, take the first subject hit
Apil_comp17195_c0_seq1
. The
sstart
and
send
values are 824 and 1072, respectively. When I go to the
all_transcriptome_seq.fasta
, I get

AAGATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAAC


at that base pair range, not

GATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAACA


as outputted by my script, which is what I'm expecting. You will also notice that the sequence outputted by my script is slightly shorter than it should be. Does anyone know how I can fix these issues in my script?

Thanks, and sorry for the lengthy post!

Answer

You can use hash to remove duplicates

The bellow code remove duplicates depending on their subject length (keep larger subject length rows).

Just update your # iterate through the individual hits part with

# iterate through the individual hits
my %filterhash;
my $subject_length;
for ($jter=0; $jter<$t1; $jter++) {
    (@templine) = split(/\s+/, $temparray[$jter]);
        $subject_length = $templine[9] -$templine[8];
        if(exists $filterhash{$templine[2]} ){
                if($filterhash{$templine[2]} < $subject_length){

                        $filterhash{$templine[2]}= $subject_length;
                }
        }
        else{
                $filterhash{$templine[2]}= $subject_length;
        }
    }
my %printhash;
for ($jter=0; $jter<$t1; $jter++) {
    (@templine) = split(/\s+/, $temparray[$jter]);
        $subject_length = $templine[9] -$templine[8];
        if(not exists $printhash{$templine[2]})
        {
                 $printhash{$templine[2]}=1;
                if(exists $filterhash{$templine[2]} and $filterhash{$templine[2]} == $subject_length ){

                        $com = "./extract_from_genome2 $transcriptomes $templine[2] $templine[8] $templine[9] $templine[2]";
                        # print "$com\n";
                        system("$com");
                        system("cat temp.3 >> $seqfname");
                        }
        }
    } # end for ($jter=0; $jter<$t1...

Hope this will help you.