user2300940 - 2 years ago 67
Perl Question

# find frequency of substring in a set of strings

I have as input a gene list where each genes has a header like >SomeText.
For each gene I would like to find the frequency of the string

`GTG`
. (number of occurences divided by length of gene). The string should only be counted if it starts at position 1,4,7,10 etc (every thids position).

``````>Gene1
GGGGTGGCCTTTGAGAAGCTGCTGTTAACGTATTTGCCAGTTACGAAGTTCCACTGAAAATTTTCCTATTAATTCTTAAGTACTCTGCATAAGGGGGAAAAGCTTCCAGAAAGCAGCCATGAACCAGGCTGTCCAGGAATGGCAGCTGTATCCAACCACAAACAACAAAGGCTACCCTTTGACCAAATGTCTTTCTCTGCAACATGG
>Gene2
GCCAGTTCCACTGCCCCCTTGATCTTTGAAGGAGTCCTCAGGCCCCTCGCAGCATAAGGATGTTTTGCAACTTT
>Gene3
GGGTGTTGGTGGAATAAGCAGTGGAATTTGAACAAGGAAGAGGAGAAAAGGGAATTTTGTCTTTATGGGGTGGGGTGATTTTCTCCTAGGGTTATGTCCAGTTGGGGTTTTTAAGGCAGCACAGACTGCCAAGTACTGTTTTTTTTAACCGACTGAAATCACTTTGGGATATTTTTTCCTGCAACACTGGAAAGTTTTAGTTTTTTAAGAAGTACTCATGCAGATATATATATATATATTTTTCCCAGTCCTTTTTTTAAGAGACGGTCTTTATTGGGTCTGCACCTCCATCCTTGATCTTGTTAGCAATGCTGTTTTTGCTGTTAGTCGG
``````

Output:

``````Gene   Frequency
Gene1: 3
Gene2 6.3
....
``````

I was thinging of something like this, but I dont now how to define the positions requirements:

``````freq <- sapply(gregexpr("GTG",x),function(x)if(x[[1]]!=-1) length(x) else 0)
``````

Here is an alternate solution that does not use a pattern match. Not that it will matter much.

``````use strict;
use warnings;

my \$gene;
while ( my \$line = <> ) {
if ( \$line =~ /^>(.+)/ ) {
\$gene = \$1;
next;
}

chomp \$line;

printf "%s: %s\n",
\$gene,
( grep { \$_ eq 'GTG' } split /(...)/, \$line ) / length \$line;
}
``````

Output:

``````Gene1: 0.00483091787439614
Gene2: 0
Gene3: 0.00302114803625378
``````

It is essentially similar to Sobrique's answer, but assumes that the gene lines contain the right characters. It splits up the gene string into a list of three-character pieces and takes the ones that are literally `GTG`.

The splitting works by abusing the fact that `split` uses a pattern as the delimiter, and that it will also capture the delimiter if a capture group is used. Here's an example.

``````my @foo = split /(...)/, '1234567890';
p @foo; # from Data::Printer

__END__
[
[0] "",
[1] 123,
[2] "",
[3] 456,
[4] "",
[5] 789,
[6] 0
]
``````

The empty elements get filter out by `grep`. It might not be the most efficient way, but it gets the job done.

You can run it by calling `perl foo.pl horribly-large-gene-sequence.file`.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download