user2300940 user2300940 - 5 months ago 13
Perl Question

find frequency of substring in a set of strings

I have as input a gene list where each genes has a header like >SomeText.
For each gene I would like to find the frequency of the string

GTG
. (number of occurences divided by length of gene). The string should only be counted if it starts at position 1,4,7,10 etc (every thids position).

>Gene1
GGGGTGGCCTTTGAGAAGCTGCTGTTAACGTATTTGCCAGTTACGAAGTTCCACTGAAAATTTTCCTATTAATTCTTAAGTACTCTGCATAAGGGGGAAAAGCTTCCAGAAAGCAGCCATGAACCAGGCTGTCCAGGAATGGCAGCTGTATCCAACCACAAACAACAAAGGCTACCCTTTGACCAAATGTCTTTCTCTGCAACATGG
>Gene2
GCCAGTTCCACTGCCCCCTTGATCTTTGAAGGAGTCCTCAGGCCCCTCGCAGCATAAGGATGTTTTGCAACTTT
>Gene3
GGGTGTTGGTGGAATAAGCAGTGGAATTTGAACAAGGAAGAGGAGAAAAGGGAATTTTGTCTTTATGGGGTGGGGTGATTTTCTCCTAGGGTTATGTCCAGTTGGGGTTTTTAAGGCAGCACAGACTGCCAAGTACTGTTTTTTTTAACCGACTGAAATCACTTTGGGATATTTTTTCCTGCAACACTGGAAAGTTTTAGTTTTTTAAGAAGTACTCATGCAGATATATATATATATATTTTTCCCAGTCCTTTTTTTAAGAGACGGTCTTTATTGGGTCTGCACCTCCATCCTTGATCTTGTTAGCAATGCTGTTTTTGCTGTTAGTCGG


Output:

Gene Frequency
Gene1: 3
Gene2 6.3
....


I was thinging of something like this, but I dont now how to define the positions requirements:

freq <- sapply(gregexpr("GTG",x),function(x)if(x[[1]]!=-1) length(x) else 0)

Answer

Here is an alternate solution that does not use a pattern match. Not that it will matter much.

use strict;
use warnings;

my $gene;
while ( my $line = <> ) {
    if ( $line =~ /^>(.+)/ ) {
        $gene = $1;
        next;
    }

    chomp $line;

    printf "%s: %s\n", 
        $gene, 
        ( grep { $_ eq 'GTG' } split /(...)/, $line ) / length $line;
}

Output:

Gene1: 0.00483091787439614
Gene2: 0
Gene3: 0.00302114803625378

It is essentially similar to Sobrique's answer, but assumes that the gene lines contain the right characters. It splits up the gene string into a list of three-character pieces and takes the ones that are literally GTG.

The splitting works by abusing the fact that split uses a pattern as the delimiter, and that it will also capture the delimiter if a capture group is used. Here's an example.

my @foo = split /(...)/, '1234567890';
p @foo; # from Data::Printer

__END__
[
    [0] "",
    [1] 123,
    [2] "",
    [3] 456,
    [4] "",
    [5] 789,
    [6] 0
]

The empty elements get filter out by grep. It might not be the most efficient way, but it gets the job done.


You can run it by calling perl foo.pl horribly-large-gene-sequence.file.

Comments