Sunny Sunny - 29 days ago 20
Perl Question

Perl program to look for k-mer with specific sequence

I am trying to enhance a perl program I have previously written so that it recognizes top 1000 length 23 k-mers that ends with GG and print out the k-mers that only appears once in the sequence. However, no matter where I add the reg exp, I am unable to get the expected result.

The code I have:

#!/usr/bin/perl
use strict;
use warnings;

my $k = 23;
my $input = '/scratch/Drosophila/dmel-2L-chromosome-r5.54.fasta';
my $output = 'uniqueKmersEndingGG.fasta';
my $match_count = 0;

#Open File
unless ( open( FASTA, "<", $input ) ) {
die "Unable to open fasta file", $!;
}

#Unwraps the FASTA format file
$/ = ">";

#Separate header and sequence
#Remove spaces
unless ( open( OUTPUT, ">", $output ) ) {
die "Unable to open file", $!;
}

<FASTA>; # discard 'first' 'empty' record

my %seen;
while ( my $line = <FASTA> ) {
chomp $line;
my ( $header, @seq ) = split( /\n/, $line );
my $sequence = join '', @seq;

for ( length($sequence) >= $k ) {
$sequence =~ m/([ACTG]{21}[G]{2})/g;

for my $i ( 0 .. length($sequence) - $k ) {
my $kmer = substr( $sequence, $i, $k );

##while ($kmer =~ m/([ACTG]{21}[G]{2})/g){
$match_count = $match_count + 1;
print OUTPUT ">crispr_$match_count", "\n", "$kmer", "\n" unless $seen{$kmer}++;
}
}
}


The input fasta file looks like this:

> >2L type=chromosome_arm; loc=2L:1..23011544; ID=2L; dbxref=REFSEQ:NT_033779,GB:AE014134; MD5=bfdfb99d39fa5174dae1e2ecd8a231cd; length=23011544; release=r5.54; species=Dmel;
CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCAT
TTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGT
GCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGA
TGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCA
TTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTG
CCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCG
CAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATT
TAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATAT
TGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAA
ACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTG
CCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCG
AGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATG
GTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAA
TAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGA
GAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTA
TATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCG
GAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGC
CAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTT
TACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATAT


and so on...

The expected outcome (print out the 23k-mers with GG ending that only appear once in the sequence) I am hoping to get:

>crispr_1
GGGTGGAGCTCCCGAAATGCAGG
>crispr_2
TTAATAAATATTGACACAGCGGG
>crispr_3
ATCGTGGGGCGTTTTGTGAAAGG
>crispr_4
AGTTTTTCACATAATCAGACAGG
>crispr_5
GTGTTGGATGAGTGTCCTCTGGG
>crispr_6
ATAGGTTGGTTGTTTTAAAAGGG
>crispr_7
AAATTTTTGTTGCCACTGAATGG
>crispr_8
AAGTTTCGAACTACGATGGTTGG
>crispr_9
CATGCTTTGTGGAAATAAGTCGG
>crispr_10
CACAGTGGGTGTTTGCACCTCGG
.... and so on


The current code I did create a fasta file with following:

>crispr_1
CGACAATGCACGACAGAGGAAGC
>crispr_2
GACAATGCACGACAGAGGAAGCA
>crispr_3
ACAATGCACGACAGAGGAAGCAG
>crispr_4
CAATGCACGACAGAGGAAGCAGA
>crispr_5
AATGCACGACAGAGGAAGCAGAA
>crispr_6
ATGCACGACAGAGGAAGCAGAAC
>crispr_7
TGCACGACAGAGGAAGCAGAACA
>crispr_8
GCACGACAGAGGAAGCAGAACAG
>crispr_9
CACGACAGAGGAAGCAGAACAGA
>crispr_10
ACGACAGAGGAAGCAGAACAGAT
.... and so on


while if I remove the

for (length($sequence) >=$k){
$sequence =~m/([ACTG]{21}[G]{2})/g;


and add the ##while ($kmer =~ m/([ACTG]{21}[G]{2})/g){

while ($kmer =~ m/([ACTG]{21}[G]{2})/g){


I am getting fasta file (with results which is not numbered correctly and unable to distinguish between duplicated and unique sequences):

>crispr_1
CATTTTCTCTCCCATATTATAGG
>crispr_2
ATTTTCTCTCCCATATTATAGGG
>crispr_3
TATTGTGCTCTTTGATTTTTTGG
>crispr_4
GATTTTTTGGCAACCCAAAATGG
>crispr_5
TTTTTGGCAACCCAAAATGGTGG
>crispr_6
TTGGCAACCCAAAATGGTGGCGG
>crispr_7
ACGACAGAGAGAGAGAGCAGCGG
>crispr_8
AAATCGACAATGCACGACAGAGG
>crispr_91
TATTGTGATCTTCGATTTTTTGG
>crispr_93
TTTTTGGCAACCCAAAATGGAGG
.... and so on


I have attempted to move the regex around my code, but none of them generated the expected result. I do not know what I did wrong over here. I have not add the exit the program when count reaches 1000 into the code yet.

Thanks in advance!

Answer

I'm not sure I understand your question completely, but could the following be what you need.

<FASTA>; # discard 'first' 'empty' record

my %data;
while (my $line = <FASTA>){
    chomp $line;
    my($header, @seq) = split(/\n/, $line);
    my $sequence = join '', @seq;

    for my $i (0 .. length($sequence) - $k) {
        my $kmer = substr($sequence, $i, $k);

        $data{$kmer}++ if $kmer =~ /GG$/;
    }
}
my $i = 0;
for my $kmer (sort {$data{$b} <=> $data{$a}} keys %data) {
    printf "crispr_%d\n%s appears %d times\n", ++$i, $kmer, $data{$kmer};
    last if $i == 1000; 
}

Some output on a file I have is:

crispr_1
ggttttccggcacccgggcctgg appears 4 times
crispr_2
ccgagctgggcgagaagtagggg appears 4 times
crispr_3
gccgagctgggcgagaagtaggg appears 4 times
crispr_4
gcacccgggcctgggtggcaggg appears 4 times
crispr_5
agcagcgggatcgggttttccgg appears 4 times
crispr_6
gctgggcgagaagtaggggaggg appears 4 times
crispr_7
cccttctgcttcagtgtgaaagg appears 4 times
crispr_8
gtggcagggaagaatgtgccggg appears 4 times
crispr_9
gatcgggttttccggcacccggg appears 4 times
crispr_10
tgagggaaagtgctgctgctggg appears 4 times
crispr_11
agctgggcgagaagtaggggagg appears 4 times

. . . .

ggcacccgggcctgggtggcagg appears 4 times
crispr_50
gaatctctttactgcctggctgg appears 4 times
crispr_51
accacaacattgacagttggtgg appears 2 times
crispr_52
caacattgacagttggtggaggg appears 2 times
crispr_53
catgctcatcgtatctgtgttgg appears 2 times
crispr_54
gattaatgaagtggttattttgg appears 2 times
crispr_55
gaaaccacaacattgacagttgg appears 2 times
crispr_56
aacattgacagttggtggagggg appears 2 times
crispr_57
gacttgatcgattaatgaagtgg appears 2 times
crispr_58
acaacattgacagttggtggagg appears 2 times
crispr_59
gaaccatatattgttatcactgg appears 2 times
crispr_60
ccacagcgcccacttcaaggtgg appears 1 times
crispr_61
ctgctcctgggtgtgagcagagg appears 1 times
crispr_62
ccatatattatctgtggtttcgg appears 1 times

. . . .

Update To get the results you mentioned in your comment (below), replace the output code with:

my $i = 1;

while (my ($kmer, $count) = each %data) {
    next unless $count == 1;
    print "crispr_$i\n$kmer\n";
    last if $i++ == 1000;
}
Comments