cebach561 cebach561 - 4 days ago 4
Python Question

For loop issue in python while using regex for pattern matching in DNA analysis

I am fairly new to Python and I have an issue with my

for
loop that I can't quite seem to figure out.

I am trying to read into a FASTA file which has the following example text:

>seq1
AAACTACCGCGTTT
>seq2
AAACTGCAACTAGCGTTT
>seq3
AAACCGGAGTTACCTAGCGTTT


What I would like to do is read into my file and print the FASTA header (e.g. header is >seq1), then I want to match two unique patterns (in this e.g. "AAA" and "TTT") present in the DNA sequence and print the DNA sequence that is between these two patterns.

So my will like my output to look like this:

>seq1
CTACCGCG
>seq2
CTGCAACTAGCG
>seq3
CCGGAGTTACCTAGCG


I have the following code:

import re
def find_seq(filename):
with open(filename) as file:
seq=''
for line in file:
header = re.search(r'^>\w+', line)
if(header):
print (header.group())
seq = seq.replace('\n','')
find_Lpattern = re.sub(r'.*AAA', '',seq)
find_Rpattern = re.sub(r'TTT.*', '',find_Lpattern)
if(find_Rpattern):
print (find_Rpattern)
seq = ''
else:
seq += line
filename = 'test.txt'
print(find_seq(filename))


I keep getting this as my output:

>seq1
>seq2
CTACCGCG
>seq3
CTGCAACTAGCG


Essentially my for loop skips over seq1 and then assigns the DNA sequence from seq1 to seq2, and the iteration on my for loop is off. Could anyone please point me in the right direction so I can fix this issue?

Answer

Even assuming your indentation is set in the way that would produce the results you describe, your logic is off. You're printing the header before you handle the accumulated seq.

When you read line 1 of your file, your header regexp matches. At that point, seq is the empty string. It therefore prints the match, and runs your replace and re.sub calls on the empty string.

Then it reads line 2, "AAACTACCGCGTTT", and appends that to seq.

Then it reads line 3, ">seq2". That matches your header regexp, so it prints the header. Then in runs your replace and sub calls on seq - which is still "AAACTACCGCGTTT" from line 2.

You need to move your seq handling to before you print the headers, and consider what will happen when you run off the end of the file without finding a final header - you will still have 'seq' contents that you want to parse and print after your for loop has ended.

Or maybe look into the third-party biopattern library, which has the SeqIO module to parse FASTA files.