eniesee eniesee - 7 months ago 13
Python Question

How to check if a list element contains a regex in Python?

I have a file with information about sequences. Every sequence has some lines. The sequences are separated by five white lines. I want to change the file into a list, and split it by 5 newlines. So that I have a list, with every sequence as one element. Then I want to remove the sequences that not contain the regular expression. At the end, I want a list, with only the sequences that contain the regex.

Now I have this. Can anyone help me further?

import re
def main():
ReadFile()
file = open ("filename.txt", "r")
CreateList(file, data)
RegEx(file, data)

def ReadFile()
try:
file = open ("filename.txt", "r")
except IOError:
print ("Can't open the file")
except:
print ("Something went wrong.")

def CreateList(file, data)
data = file.readlines()
data = data.split('\n\n\n\n\n')

def RegEx(file, data)
regex = ("[AG].{4}GK[ST]")
for element in data:
if regex not in element:
data.remove(element)
print (data)

main()


File looks like:

Hits for PS00017|ATP_GTP_A (pattern) ATP/GTP-binding site motif A (P-loop) : [occurs frequently]
Pattern: [AG]-x(4)-G-K-[ST]
Approximate number of expected random matches in ~ 100'000 sequences (50'000'000 residues): 3371


>sp|Q6GZX2|003R_FRG3G (438 aa)
Uncharacterized protein 3R. [Frog virus 3 (isolate Goorha) (FV-3)]
MARPLLGKTSSVRRRLESLSACSIFFFLRKFCQKMASLVFLNSPVYQMSNILLTERRQVDRAMGGSDDDGVMVVALSPSD
FKTVLGSALLAVERDMVHVVPKYLQTPGILHDMLVLLTPIFGEALSVDMSGATDVMVQQIATAGFVDVDPLHSSVSWKDN
VSCPVALLAVSNAVRTMMGQPCQVTLIIDVGTQNILRDLVNLPVEMSGDLQVMAYTKDPLGKVPAVGVSVFDSGSVQKGD
AHSVGAPDGLVSFHTHPVSSAVELNYHAGWPSNVDMSSLLTMKNLMHVVVAEEGLWTMARTLSMQRLTKVLTDAEKDVMR
AAAFNLFLPLNELRVMGTKDSNNKSLKTYFEVFETFTIGALMKHSGVTPTAFVDRRWLDNTIYHMGFIPWGRDMRFVVEY
DLDGTNPFLNTVPTLMSVKRKAKIQEMFDNMVSRMVTS
2 - 9: ArpllGKT


>sp|Q6GZX1|004R_FRG3G (60 aa)
Uncharacterized protein 004R. [Frog virus 3 (isolate Goorha) (FV-3)]
MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTSFHTASPSFSSRYRY
33 - 40: GyyydGKT


>sp|Q6GZW0|015R_FRG3G (322 aa)
Uncharacterized protein 015R. [Frog virus 3 (isolate Goorha) (FV-3)]
MEQVPIKEMRLSDLRPNNKSIDTDLGGTKLVVIGKPGSGKSTLIKALLDSKRHIIPCAVVISGSEEANGFYKGVVPDLFI
YHQFSPSIIDRIHRRQVKAKAEMGSKKSWLLVVIDDCMDNAKMFNDKEVRALFKNGRHWNVLVVIANQYVMDLTPDLRSS
VDGVFLFRENNVTYRDKTYANFASVVPKKLYPTVMETVCQNYRCMFIDNTKATDNWHDSVFWYKAPYSKSAVAPFGARSY
WKYACSKTGEEMPAVFDNVKILGDLLLKELPEAGEALVTYGGKDGPSDNEDGPSDDEDGPSDDEEGLSKDGVSEYYQSDL
DD
34 - 41: GkpgsGKS',


>sp|P32234|128UP_DROME (368 aa)
GTP-binding protein 128up. [Drosophila melanogaster (Fruit fly)]
MSTILEKISAIESEMARTQKNKATSAHLGLLKAKLAKLRRELISPKGGGGGTGEAGFEVAKTGDARVGFVGFPSVGKSTL
LSNLAGVYSEVAAYEFTTLTTVPGCIKYKGAKIQLLDLPGIIEGAKDGKGRGRQVIAVARTCNLIFMVLDCLKPLGHKKL
LEHELEGFGIRLNKKPPNIYYKRKDKGGINLNSMVPQSELDTDLVKTILSEYKIHNADITLRYDATSDDLIDVIEGNRIY
IPCIYLLNKIDQISIEELDVIYKIPHCVPISAHHHWNFDDLLELMWEYLRLQRIYTKPKGQLPDYNSPVVLHNERTSIED
FCNKLHRSIAKEFKYALVWGSSVKHQPQKVGIEHVLNDEDVVQIVKKV
71 - 78: GfpsvGKS


Data it should be (but only proteins containing the RegEx):

['>sp|Q6GZX2|003R_FRG3G (438 aa)
Uncharacterized protein 3R. [Frog virus 3 (isolate Goorha) (FV-3)]
MARPLLGKTSSVRRRLESLSACSIFFFLRKFCQKMASLVFLNSPVYQMSNILLTERRQVDRAMGGSDDDGVMVVALSPSD
FKTVLGSALLAVERDMVHVVPKYLQTPGILHDMLVLLTPIFGEALSVDMSGATDVMVQQIATAGFVDVDPLHSSVSWKDN
VSCPVALLAVSNAVRTMMGQPCQVTLIIDVGTQNILRDLVNLPVEMSGDLQVMAYTKDPLGKVPAVGVSVFDSGSVQKGD
AHSVGAPDGLVSFHTHPVSSAVELNYHAGWPSNVDMSSLLTMKNLMHVVVAEEGLWTMARTLSMQRLTKVLTDAEKDVMR
AAAFNLFLPLNELRVMGTKDSNNKSLKTYFEVFETFTIGALMKHSGVTPTAFVDRRWLDNTIYHMGFIPWGRDMRFVVEY
DLDGTNPFLNTVPTLMSVKRKAKIQEMFDNMVSRMVTS
2 - 9: ArpllGKT',


'>sp|Q6GZX1|004R_FRG3G (60 aa)
Uncharacterized protein 004R. [Frog virus 3 (isolate Goorha) (FV-3)]
MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTSFHTASPSFSSRYRY
33 - 40: GyyydGKT',


'>sp|Q6GZW0|015R_FRG3G (322 aa)
Uncharacterized protein 015R. [Frog virus 3 (isolate Goorha) (FV-3)]
MEQVPIKEMRLSDLRPNNKSIDTDLGGTKLVVIGKPGSGKSTLIKALLDSKRHIIPCAVVISGSEEANGFYKGVVPDLFI
YHQFSPSIIDRIHRRQVKAKAEMGSKKSWLLVVIDDCMDNAKMFNDKEVRALFKNGRHWNVLVVIANQYVMDLTPDLRSS
VDGVFLFRENNVTYRDKTYANFASVVPKKLYPTVMETVCQNYRCMFIDNTKATDNWHDSVFWYKAPYSKSAVAPFGARSY
WKYACSKTGEEMPAVFDNVKILGDLLLKELPEAGEALVTYGGKDGPSDNEDGPSDDEDGPSDDEEGLSKDGVSEYYQSDL
DD
34 - 41: GkpgsGKS',


'>sp|P32234|128UP_DROME (368 aa)
GTP-binding protein 128up. [Drosophila melanogaster (Fruit fly)]
MSTILEKISAIESEMARTQKNKATSAHLGLLKAKLAKLRRELISPKGGGGGTGEAGFEVAKTGDARVGFVGFPSVGKSTL
LSNLAGVYSEVAAYEFTTLTTVPGCIKYKGAKIQLLDLPGIIEGAKDGKGRGRQVIAVARTCNLIFMVLDCLKPLGHKKL
LEHELEGFGIRLNKKPPNIYYKRKDKGGINLNSMVPQSELDTDLVKTILSEYKIHNADITLRYDATSDDLIDVIEGNRIY
IPCIYLLNKIDQISIEELDVIYKIPHCVPISAHHHWNFDDLLELMWEYLRLQRIYTKPKGQLPDYNSPVVLHNERTSIED
FCNKLHRSIAKEFKYALVWGSSVKHQPQKVGIEHVLNDEDVVQIVKKV
71 - 78: GfpsvGKS']

Answer
import re
file = open("ploop.txt")
text = file.read()
file.close()

proteins = text.split("\n\n")[1:]
proteinsMatching = []
toWrite = "" 

for protein in proteins:
    if re.search(r"[AG].{4}GK[ST]", protein):
        proteinsMatching.append(protein)        


for protein in proteinsMatching:
    accensionCode = re.findall(r">sp\|(.{6})", protein)[0]
    organism = re.findall(r"\n.+?\[(.+?)\]", protein)[0]
    print(accensionCode, organism)
    toWrite += accensionCode + " " + organism + "\n"

f = open("results.txt", "w+")
f.write(toWrite)
f.close()

# Q6GZX2 Frog virus 3 (isolate Goorha) (FV-3)
# Q6GZX1 Frog virus 3 (isolate Goorha) (FV-3)
# Q6GZW0 Frog virus 3 (isolate Goorha) (FV-3)
# P32234 Drosophila melanogaster (Fruit fly)

updated (again) for new requirements

Regex1 (Splitting text file into list of proteins:) https://regex101.com/r/gU0gX5/1

Regex2 (Your regex showing that they all match) https://regex101.com/r/nZ0pD6/1