Biotechgeek Biotechgeek - 18 days ago 4
Python Question

How to read files with separators in Python and append characters at the end?

I have a file format that looks like this


Note: "$$$" divides the file, such that anything before $$$ is Set 1 and after $$$ is Set 2 and after the next $$$ Set3 etc.

I have to do the following:

a. Concatenate the sequences following ">". So, I have to join "ATGC" "TTTT" "ATGC" and store in (1) and I have to concatenate "ATCG" "TT-G" "TTCG" "TT-G" "TTCG""TTCG" and store as (2)... concatenate again and store in (3)

The output should be a list that looks like:


(2) Then, I find the the Set that has the maximum length => here Set(2)

(3) If length of Set i is not equal to Set (2), then I add "Z" at the end Set i, such that length of Set i is now equal to length of Set (2)

(4) I replace all "-" with "Z"

The output should look like:


Here is the code, I have attempted:

in_file = open('c:/test.txt','r')
org = []
seqlist = []
seqstring = ""

for line in in_file:
if line.startswith("$$$"):
if seqstring!= "":
seqstring = ""
elif line.startswith(">"):
seqstring += line.rstrip("\n")

setdraft = seqlist
maxsetlength = max(len(seqlist))

setdraft2 =[]

for i in setdraft:
if len(i) != maxsetlength:
setdraft2 = i.append("Z")

setfinal =[]

for j in setdraft2:
if j in setdraft2 =="-":
setfinal = j.insert ("Z")

The above script does not work. It gives me multiple errors.
Eg. When I print
it gives me the output


which is the same as the input

Traceback (most recent call last):
File "C:/Users/ACER/Desktop/", line 25, in <module>
maxsetlength = max(len(seqlist))
TypeError: 'int' object is not iterable


It's unclear how fragile your data set is, but if it follows the pattern above (namely the last 4 characters are the ones you are looking for) then you can use a couple of split()s and itertools.zip_longest and zip it back to append the Z

>>> import itertools as it
>>> import string
>>> def digit_index(s):
...     for i, c in enumerate(s):
...         if c in string.digits:
...             return i
...     return 0
>>> parsed = [''.join(y[digit_index(y)+1:].replace('-', 'Z') for y in x.split('>')) for x in s.split('$$$')]
>>> parsed
>>> [''.join(x) for x in zip(*it.zip_longest(*parsed, fillvalue='Z'))]

If you don't mind it as a list then you can avoid join()ing it back to a string:

>>> list(zip(*it.zip_longest(*parsed, fillvalue='Z')))
[('A', 'T', 'G', 'C', 'T', 'T', 'T', 'T', 'A', 'T', 'G', 'C', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z'), 
 ('A', 'T', 'C', 'G', 'T', 'T', 'Z', 'G', 'T', 'T', 'C', 'G', 'T', 'T', 'Z', 'G', 'T', 'T', 'C', 'G', 'T', 'T', 'C', 'G'),
 ('T', 'T', 'T', 'T', 'A', 'T', 'G', 'C', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z')]