Tomb Tomb - 4 months ago 6x
Perl Question

How to optimize my perl code to make it faster

I have this code that is working fine but it's taking pretty much 100% of my cpu to run and it takes around 25min. I'd really like to optimize it but don't know what parts I could improve.

The main problem I guess is that I'm using coordinates of a file to be compared with a 3,5GB file.
So I tried putting the information of this 3,5GB file in a hash to make it faster but it's not what I thought it would be.
I want to compare exon coordinates to the whole human genome (hg38), see which two letters are before the exon position and which two are in the end, so that would count as an intron. Then, for each intron I want to see if those two letters are compatible with GUAG or not.

use warnings;
use strict;

print "Enter file path (refGene)\n";
my $ref_gene = <>;
chomp ($ref_gene);
my $local1 = "./$ref_gene";

print "Enter file path (hg38)\n";
my $filename = <>;
chomp ($filename);
my $local2 = "./$filename";

open (HG, $local2) or die "error opening $filename\n";

my %chromosome;
my $key;
while (my $line = <HG>) {
chomp ($line);
if ($line =~ /^\>/) {
$key = $line;
$key =~ s/\>//; # / stop editor red coloring
$chromosome{$key} = "";
else {
$chromosome{$key} .= uc $line;

close (HG);

open (REFGENE, $local1) or die "error opening $ref_gene\n";

my $total_introns = 0;
my $GUAG_introns = 0;
while (my $line = <REFGENE>) {
my @column = split ("\t", $line);
my $chr = $column[2];
my $sense = $column[3];
my $number_exons = $column[8];
my $beginning_of_exon = $column[9];
my $end_of_exon = $column[10];

my @beginning_exons = split (",", $beginning_of_exon);
my @end_exons = split (",", $end_of_exon);
foreach my $element (@end_exons) { #subtract 1 from end of exon because its 1-based index
$element -= 1;

my $introns = ($number_exons - 1); #amount of introns = exons - 1
$total_introns += $introns;

my $current_seq = $chromosome{$chr};

for (my $i=0; $i<@beginning_exons-1; $i++) {
my $end = substr ($current_seq, $end_exons[$i]+1, 2); #get two letters after end of exon
my $beginning = substr ($current_seq, $beginning_exons[$i+1]-2, 2); #get two letters before beggining of exon

if ($sense eq "+") {
if ( ($end eq "GT") && ($beginning eq "AG") ) {
$GUAG_introns += 1;

elsif ($sense eq "-") {

if ( ($end eq "CT") && ($beginning eq "AC") ) {
$GUAG_introns += 1;

close (REFGENE);

my $GUAG = (($GUAG_introns/$total_introns)*100); #percentage of GUAG introns
print "number of GUAG introns = $GUAG_introns \n total introns = $total_introns \n";
print " $GUAG% of GT-AG introns.\n\n";



717 NM_000525 chr11 - 17385248 17388659 17386918 17388091 1 17385248, 17388659,
987 NM_000242 chr10 - 52765379 52771700 52768136 52771635 4 52765379,52769246,52770669,52771448, 52768510,52769315,52770786,52771700,

I know it's a lot of code and that nobody will download 3,5GB of human genome to run this but I'd like to know which kind of improvements I can do to optimize the code and not use 100% of cpu. Thanks

edit: The two letters before the exon position would be AG or the last two letters of an intron

column 9 has the coordinates to the start of an exon and column 10 to the end, I'm trying to get the things in the middle and see if they start with GT and end with AG. Column 3 is the sense of the sequence, so if it's negative (-) instead of looking for GUAG (U cause it's RNA in the end but the reference(hg38) is T) I'll be looking for the reverse complement (AC instead of GT, and CT instead of AG). Column 2 is the chromosome, so for chromosome 6 I'd look the reference in hg38 for chromosome 6 and count the introns. The other file is 15MB


I have a program that seems to work and produce the same results as your own code. However I have a problem in that the intron doesn't seem to be defined properly

Suppose my sequence is the alphabet and the exons we have are defined by

@beginning_exons = (5, 17);
@end_exons       = (12, 21);
                  1                   2 
1 2 3 4|5 6 7 8 9 0 1 2|3 4 5 6|7 8 9 0 1|2 3 4 5 6 
A B C D|E F G H I J K L|M N O P|Q R S T U|V W X Y Z 
        [=============] <-----> [=======] 

so they are EFGHIJKL and QRSTU, forming an intron MNOP. The calculation in your program, and in my code that produces the same result, gives $end = 'MN', which is fine, but $beginning = 'PQ', which I don't understand. It is the last character of the intron and the first character of the exon

Can this be right? I have put exactly the values above into your code and mine and get the same result

Here's my solution

The main reason yours is so slow is that it's reading the HG38 file line by line, which means it has to upper case each line and append it to the sequence up to 4 million times per sequence

That's very slow because the repeated appending will mean that the string will frequently become too big for the space allocated to it, and it has to be copied to a larger space before it can be expanded. Copying such a huge string thousands of times takes a lot of processor work

I have written it so that a whole chromosome is read at a time. Then all that has to be done is to remove the newlines and set it to upper case, just once

I've also printed the each chromosome's name when it is encountered, to give some confirmation that the process is progressing. (You can stop this by removing the print " $key\n" statement.) The time taken to test the introns at the end is minimal

As I said, this code produces the results 492784 /499504 that you expect, but I'm still concerned about the end pair of bases. Please let me know if you have an explanation

use strict;
use warnings FATAL => 'all';


my ( $ref_gene_file, $hg38_file ) = qw/ refGene.txt hg38.fa /;

my %chromosome;

    print "Processing HG38:\n";

    open my $hg38_fh, '<', $hg38_file or die qq{Error opening "$hg38_file" for input: $!};

    local $/ = '>';

    my $key;

    while (<$hg38_fh>) {

        chomp;                    # Remove the '>'
        next unless /\S/;         # Ignore empty records

        s/(.+)// and $key = $1;   # Remove the name and store it /
        print "    $key\n";

        # Remove all non-alpha characters and upper-case everything
        $chromosome{$key} = uc tr/A-Za-z//cdr;     # /

    print "\n";

my $total_introns = 0;
my $GUAG_introns  = 0;

    open my $ref_gene_fh, '<', $ref_gene_file
            or die qq{Error opening "$ref_gene_file" for input: $!};

    while ( <$ref_gene_fh> ) {

        my @column = split;

        my ( $chr, $sense, $n, $start, $end ) = @column[ 2, 3, 8, 9, 10 ];
        my @starts = $start =~ /\d+/g;    # /
        my @ends   = $end   =~ /\d+/g;    # /

        die "Data discrepancy" unless @starts == $n and @ends == $n;

        my $introns = $n - 1;    # Amount of introns = exons - 1
        $total_introns += $introns;

        my $sequence = $chromosome{$chr} or die "No such chromosome";

        for my $i ( 0 .. $#starts-1 ) {

            # Convert 1-based character numbers to string offsets
            my ($curr_end_index, $next_begin_index) = ( $ends[$i], $starts[$i+1]-1 );

            my $intron_prefix = substr( $sequence, $curr_end_index, 2 );
            my $intron_suffix = substr( $sequence, $next_begin_index-1, 2 );  # Not -2 !!

            if ( $sense eq '+' ) {
                ++$GUAG_introns if $intron_prefix eq 'GT' and $intron_suffix eq 'AG';
            elsif ( $sense eq '-' ) {
                ++$GUAG_introns if $intron_prefix eq 'CT' and $intron_suffix eq 'AC';

printf "Number of GUAG introns = %d\n", $GUAG_introns;
printf "Total introns = %d\n",          $total_introns;
printf "%.1f%% of GT-AG introns.\n\n",  $GUAG_introns / $total_introns * 100;


Processing HG38:

Number of GUAG introns = 492784
Total introns = 499504
98.7% of GT-AG introns.

[Finished in 39.1s]