biogeek biogeek - 1 month ago 6
Python Question

How to parse data with binary elements into a list of lists in Python?

Sample looks like this:

lst = ['ms 20 3 -s 10 \n', '17954 11302 58011\n', '\n', '$$\n', 'segsites: 10\n', 'positions: 0.0706 0.2241 0.2575 0.889 \n', '0001000010\n', '0101000010\n', '0101010010\n', '0001000010\n', '\n', '$$\n', 'segsites: 10\n', 'positions: 0.0038 0.1622 0.1972 \n', '0110000110\n', '1001001000\n', '0010000110\n', '$$\n', 'segsites: 10\n', 'positions: 0.0155 0.0779 0.2092 \n', '0000001011\n', '0000001011\n', '0000001011\n']


Every new set starts with the $$. I need to parse the data such that, I have the following list of lists.

sample = [['0001000010', '0101000010', '0101010010', '0001000010'],['0110000110', '1001001000', '0010000110'],['0000001011', '0000001011', '0000001011'] # Required Output


Code attempted

sample =[[]]
sample1 = ""
seqlist = []

for line in lst:
if line.startswith("$$"):
if line in '01': #Line contains only 0's or 1
sample1.append(line) #Append each line that with 1 and 0's in a string one after another
sample.append(sample1.strip()) #Do this or last line is lost
print sample

Output:[[], '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']


I am a newbie at parsing data and trying to figure out how to get this right. Suggestions on how to modify the code along with explanation is appreciated.

Answer

I'd do it in the following way:

import re

lst = ['ms 20 3 -s 10 \n', '17954 11302 58011\n', '\n', '$$\n', 'segsites: 10\n', 'positions: 0.0706 0.2241 0.2575 0.889 \n', '0001000010\n', '0101000010\n', '0101010010\n', '0001000010\n', '\n', '$$\n', 'segsites: 10\n', 'positions: 0.0038 0.1622 0.1972 \n', '0110000110\n', '1001001000\n', '0010000110\n', '$$\n', 'segsites: 10\n', 'positions: 0.0155 0.0779 0.2092 \n', '0000001011\n', '0000001011\n', '0000001011\n']

result = []
curr_group = []
for item in lst:
    item = item.rstrip() # Remove \n
    if '$$' in item:
        if len(curr_group) > 0: # Check to see if binary numbers have been found.
            result.append(curr_group)
            curr_group = []
    elif re.match('[01]+$', item): # Checks to see if string is binary (0s or 1s).
        curr_group.append(item)

result.append(curr_group) # Appends final group due to lack of ending '$$'. 

print(result)

Basically, you want to iterate through the items until you find '$$', then add any binary characters you've found previously to your final result, and start a new group. Every binary string you find (using the regex) should be added to the current group.

Finally, you need to add the last set of binary numbers, since there is no trailing '$$'