zebra zebra - 1 year ago 116
Perl Question

Finding inverted repeats in DNA sequence

I have a long string of DNA sequence, and I need to find regions consist of two palindromic sequences flanking a spacer sequence.

The input is:


cgtacacgagtagtcgtagctgtcagtcgatcgtacgtacgtagctgctgtagcactatcgaccccacacgtgtgtacacgatgcacagtcgtctatcacatgctagcgctgcccgtacgGATGGCCAAGGCCATCcgatcgctagctagcgccgcgcgtagcccgatcgagacatgctagcagttgtgctgatgtcgagatagctgtgatgcgatgctagcgccgcctagccgcctcgtgtaggctggatgcgatcgatcgatgctagcggcgcgatcgatgcactagccgtagcgctagctgatcgatcgtaGATGGCCAAGGCCATCcgcgtagatacgacatgccgggggtatataa


This is my code:

use strict;
use warnings;

my $input= $ARGV[0];
chomp $input;
open (my $fh_in, "<", $input) or die "Cannot open file $input $!";
my $dna= <$fh_in>;
chomp $dna;

#######################################################################################

if ($dna=~ /[^ACGT]/gi) {
print "This is not a valid DNA sequence, it has unknown base(s)\n";
}

$dna=~ tr/[acgt]/[ACGT]/;


######################################################################################

print "Minimum length of palindromic sequence?\n";
my $min= <STDIN>;
chomp $min;

print "Maximum length of palindromic sequence?\n";
my $max= <STDIN>;
chomp $max;

print "Minimum length of spacer region?\n";
my $min_spacer= <STDIN>;
chomp $min_spacer;

print "Maximum length of spacer region?\n";
my $max_spacer= <STDIN>;
chomp $max_spacer;

######################################################################################

my $dna_length= length($dna);
my ($length , $offset , $string_1 , $string_2);

for ($offset= 0 ; $offset <= $dna_length-$max-$max-$max_spacer ; $offset++) {
for ($length= $min ; $length <= $max ; $length++) {
$string_1= substr ($dna, $offset, $length);
$string_2= reverse $string_1;
$string_2=~ tr/[ACGT]/[TGCA]/;
if ($dna=~ /(($string_1)([ACGT]{$min_spacer,$max_spacer})($string_2))/) {
print "IR starts at $offset => $2***$3***$4\n$1\n\n";
}
}
}


With parameters:
$min = 6 , $max = 12 , $min_spacer = 4 , $max_spacer = 12
The output I get is:

IR starts at 26 => TCGATCG***ATGCTAGCGGCG***CGATCGA
TCGATCGATGCTAGCGGCGCGATCGA

IR starts at 27 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG

IR starts at 118 => CGGATG***GCCAAGGC***CATCCG
CGGATGGCCAAGGCCATCCG

IR starts at 118 => CGGATGG***CCAAGG***CCATCCG
CGGATGGCCAAGGCCATCCG

IR starts at 118 => CGGATGGC***CAAG***GCCATCCG
CGGATGGCCAAGGCCATCCG

IR starts at 119 => GGATGG***CCAAGG***CCATCC
GGATGGCCAAGGCCATCC

IR starts at 119 => GGATGGC***CAAG***GCCATCC
GGATGGCCAAGGCCATCC

IR starts at 120 => GATGGC***CAAG***GCCATC
GATGGCCAAGGCCATC

IR starts at 136 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG

IR starts at 164 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG

IR starts at 252 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG

IR starts at 254 => ATCGAT***GCTAGCGGCGCG***ATCGAT
ATCGATGCTAGCGGCGCGATCGAT

IR starts at 254 => ATCGATCG***ATGCTAGCGGCG***CGATCGAT
ATCGATCGATGCTAGCGGCGCGATCGAT

IR starts at 255 => TCGATCG***ATGCTAGCGGCG***CGATCGA
TCGATCGATGCTAGCGGCGCGATCGA

IR starts at 256 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG

IR starts at 258 => ATCGAT***GCTAGCGGCGCG***ATCGAT
ATCGATGCTAGCGGCGCGATCGAT

IR starts at 274 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG

IR starts at 276 => ATCGAT***GCTAGCGGCGCG***ATCGAT
ATCGATGCTAGCGGCGCGATCGAT

IR starts at 304 => ATCGAT***GCTAGCGGCGCG***ATCGAT
ATCGATGCTAGCGGCGCGATCGAT

IR starts at 304 => ATCGATCG***ATGCTAGCGGCG***CGATCGAT
ATCGATCGATGCTAGCGGCGCGATCGAT

IR starts at 305 => TCGATCG***ATGCTAGCGGCG***CGATCGA
TCGATCGATGCTAGCGGCGCGATCGA

IR starts at 306 => CGATCG***ATGCTAGCGGCG***CGATCG
CGATCGATGCTAGCGGCGCGATCG

IR starts at 314 => GATGGC***CAAG***GCCATC
GATGGCCAAGGCCATC


However when I check the region of my first hit (highlighted in bold in input), the offset of this hit doesn't seem to be at position 26. Could anyone enlighten me what's wrong with my code? Thanks.

Answer Source

Your problem is your regex is looking for a palindromes anywhere in the sequence, not just at the location of the offset. "ATCGATCG" occurs at offset 26, so it matches. You need to add some positional information to your regex. Try something like

/^[ACGT]{$offset}(($string_1)([ACGT]{$min_spacer,$max_spacer})($string_2))/
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download