Sunny Sunny - 1 month ago 17
Perl Question

Find and print all overlapping k-mers

I am trying to write a perl program that reads a fasta file and prints out a text file containing all available (overlapping) length 15 k-mers from the sequence (fasta) file. This program works perfectly fine when I'm searching for non-overlapping k-mers, but when I coded it to find overlapping k-mers, it takes forever for it to execute and Cygwin ended up killed program after 12 hours. (I left the match_count there to count the total, please feel free to ignore that line)

#!/usr/bin/perl
use strict;
use warnings;

my $k = 15;
my $input = '/scratch/Drosophila/dmel-2L-chromosome-r5.54.fasta';
my $output = 'kmers.txt';
my $match_count = 0;

#Open File
unless (open(FASTA, "<", $input)){
die "Unable to open fasta file", $!;
}

#Unwraps the FASTA format file
$/=">";
#Separate header and sequence
#Remove spaces
unless (open(OUTPUT, ">", $output)){
die "Unable to open file", $!;
}

while (my $line = <FASTA>){
my($header, @seq) = split(/\n/, $line);
my $sequence = join '', @seq;

while (length($sequence) >= $k){
$sequence =~ m/(.{$k})/;
print OUTPUT "$1\n";
$sequence = substr($sequence, 1, length($sequence)-1);
}
}


The result I am looking for is:

A total of 20938309 k-mers printed in the text file when I use the wc -l command.


Thanks in advance!

Answer

Not sure why you're not getting your desired results.

I thought I'd post the 2 programs I've used following your problem description.

The first one just counts the kmers in a file I used for testing, (fasta_dat.txt). It doesn't print them out but is just a check to see how many kmers there are.

#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;

my $in  = Bio::SeqIO->new( -file   => "fasta_dat.txt" ,
                           -format => 'fasta');

my $count_kmers;
my $k = 15;
while ( my $seq = $in->next_seq) {
    $count_kmers += $seq->length - $k + 1;
}

print $count_kmers;

__END__
C:\Old_Data\perlp>perl t9.pl
18657

You can see the count (after the __END__ token), 18657. This count agreed with the count of the kmers when I printed them out using your code.

#!/usr/bin/perl
use strict;
use warnings;
use 5.014;
use Devel::Size 'total_size';

my $k = 15;
my $input = 'fasta_dat.txt';
my $output = 'kmers.txt';
my $match_count = 0;

#Open File
unless (open(FASTA, "<", $input)){
    die "Unable to open fasta file", $!;
    }

    #Unwraps the FASTA format file
    $/=">";
    #Separate header and sequence
    #Remove spaces
unless (open(OUTPUT, ">", $output)){
    die "Unable to open file", $!;
}

<FASTA>; # discard 'first' 'empty' record

my %seen;
while (my $line = <FASTA>){
    chomp $line;
    my($header, @seq) = split(/\n/, $line);
    my $sequence = join '', @seq;

    for my $i (0 .. length($sequence) - $k) {
        my $kmer = substr($sequence, $i, $k);
        print OUTPUT $kmer, "\n" unless $seen{$kmer}++;
    }
}
print total_size(\%seen);

Update Tests I ran showed about a 100 times increase in memory for the hash size. The number of kmers in my test were about 18500. That resulted in a hash size of 1.8MB.

For your data, with kmers of 22M, would result in a hash size ~ 2.2GB. Don't know if this would exceed your memory capacity.