user2300940 user2300940 - 7 months ago 17
Perl Question

find frequency of substring in a set of strings

I have as input a gene list where each genes has a header like >SomeText.
For each gene I would like to find the frequency of the string

. (number of occurences divided by length of gene). The string should only be counted if it starts at position 1,4,7,10 etc (every thids position).



Gene Frequency
Gene1: 3
Gene2 6.3

I was thinging of something like this, but I dont now how to define the positions requirements:

freq <- sapply(gregexpr("GTG",x),function(x)if(x[[1]]!=-1) length(x) else 0)


Here is an alternate solution that does not use a pattern match. Not that it will matter much.

use strict;
use warnings;

my $gene;
while ( my $line = <> ) {
    if ( $line =~ /^>(.+)/ ) {
        $gene = $1;

    chomp $line;

    printf "%s: %s\n", 
        ( grep { $_ eq 'GTG' } split /(...)/, $line ) / length $line;


Gene1: 0.00483091787439614
Gene2: 0
Gene3: 0.00302114803625378

It is essentially similar to Sobrique's answer, but assumes that the gene lines contain the right characters. It splits up the gene string into a list of three-character pieces and takes the ones that are literally GTG.

The splitting works by abusing the fact that split uses a pattern as the delimiter, and that it will also capture the delimiter if a capture group is used. Here's an example.

my @foo = split /(...)/, '1234567890';
p @foo; # from Data::Printer

    [0] "",
    [1] 123,
    [2] "",
    [3] 456,
    [4] "",
    [5] 789,
    [6] 0

The empty elements get filter out by grep. It might not be the most efficient way, but it gets the job done.

You can run it by calling perl horribly-large-gene-sequence.file.