Kyle Weise Kyle Weise - 1 month ago 7
Perl Question

Perl regular expression (starts with ATG and ends with TAG, TAA, or TGA)

I need a regular expression in perl that will match with ATG at the start, and ends with either TAG, TAA, or TGA. This is the code I have so far.

my $sequence = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAACGAA';

while($sequence =~ ____) {
print $1;
}

Answer

Since you're dealing with codons here, you probably forgot to mention that the nuclotides in between must be a multiple of 3.

Code:

my $sequence = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAACGAA';

while($sequence =~ /ATG(?:[ACTG]{3})*?T(?:A[AG]|GA)/g)
{
    print $&."\n";
}

Output:

ATGGTTTCTCCCATCTCTCCATCGGCATAAATGATCTAA

Description:

  • ATG - Matches "ATG" literally
  • (?:[ACTG]{3})*? - is a non capturing group, repeated 0 o more times, as few as possible (lazy quantifier, the extra ?), matching:
    • [ACTG]{3} - 3 characters/nucleotides (either "A", "C", "T" or "G")
  • T(?:A[AG]|GA) - matches "TAA", "TAG", or "TGA". Also, as Borodin commented, this can be written as (?:TAG|TAA|TGA) if you prefer to improve readability.


But if you also need to match overlapping sequences, you should use a lookahead to prevent the match from consuming the characters.

Code:

my $sequence = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAACGAA';

while($sequence =~ /ATG(?=((?:[ACTG]{3})*?T(?:A[AG]|GA)))/g)
{
    print $&.$1."\n";
}

Output:

ATGGTTTCTCCCATCTCTCCATCGGCATAA
ATGATCTAA


And finally, this is a more efficent version of the last expression, using the Unrolling the Loop technique, that will yield better results when you're dealing with large sequences.

Code:

my $sequence = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAACGAA';

while($sequence =~ /ATG(?=((?:[ACG][ACTG]{2})*(?:T(?!A[AG]|GA)[ACTG]{2}(?:[ACG][ACTG]{2})*)*T(?:A[AG]|GA)))/g)
{
    print $&.$1."\n";
}

Output:

ATGGTTTCTCCCATCTCTCCATCGGCATAA
ATGATCTAA