Jamie Leigh Jamie Leigh - 26 days ago 6
Python Question

Split in python and strip whitespace

I am learning Python, and currently working on reading in a file, splitting the lines and then printing specific elements. I am having trouble splitting multiple times though. The file I am working on has many lines that look like this

c0_g1_i1|m.1 gi|74665200|sp|Q9HGP0.1|PVG4_SCHPO 100.00 372 0 0 1 372 1 372 0.0 754


I am trying to split it, first by tab and newline "/t/n", and then split the elements with |, I have tried .split and .strip and am not having much luck. I figured maybe if I just worked on a single line I could get the idea down, and then modify it into a loop that would access the file

blast_out = ("c0_g1_i1|m.1 gi|74665200|sp|Q9HGP0.1|PVG4_SCHPO 100.00 372 0 0 1 372 1 372 0.0 754")
fields = blast_out.strip(' \t\r\n').split()
subFields = fields.split("|")
print(fields)
print(subFields)


print(fields)

['c0_g1_i1|m.1', 'gi|74665200|sp|Q9HGP0.1|PVG4_SCHPO', '100.00', '372', '0', '0', '1', '372', '1', '372', '0.0', '754']


print(subFields) generates an error

subFields = fields.split('|')
AttributeError: 'list' object has no attribute 'split'


This is what I did just to try to strip the whitespace and the tabs, then to split on | but it doesn't seem to do anything. Eventually my desired output from this single string would be

c0_g1_i1 m.1 Q9HGP0.1 100.0

Answer

You have a list of separate strings now. It looks as if the input format encodes nested lists; the outer format delimited by whitespace, the inner by | characters.

You can split the outer string, then split each resulting element again in a list comprehension:

[item.split('|') for item in blast_out.split()]

Note that the str.strip() is entirely redundant, the str.split() call (with no argument or None as the first argument) already removes leading and trailing whitespace.

If you expected a flat list, you'd add another loop to the comprehension:

[value for item in blast_out.split() for value in item.split('|')]

The former would be preferable if the number of items in the inner lists is variable; it is easier to find the first or last element of a nested list than to figure out in a flat list where each whitespace-delimited section starts or ends.

Your final values for your given example then can be extracted with one of the two following expressions, depending on which variant you picked:

(result[0][0], result[0][1], result[1][3], result[2][0])

or

(result[0], result[1], result[5], result[7])

Demo:

>>> blast_out = "c0_g1_i1|m.1    gi|74665200|sp|Q9HGP0.1|PVG4_SCHPO      100.00  372     0       0       1       372     1       372     0.0       754"
>>> [item.split('|') for item in blast_out.split()]
[['c0_g1_i1', 'm.1'], ['gi', '74665200', 'sp', 'Q9HGP0.1', 'PVG4_SCHPO'], ['100.00'], ['372'], ['0'], ['0'], ['1'], ['372'], ['1'], ['372'], ['0.0'], ['754']]
>>> (_[0][0], _[0][1], _[1][3], _[2][0])
('c0_g1_i1', 'm.1', 'Q9HGP0.1', '100.00')
>>> [value for item in blast_out.split() for value in item.split('|')]
['c0_g1_i1', 'm.1', 'gi', '74665200', 'sp', 'Q9HGP0.1', 'PVG4_SCHPO', '100.00', '372', '0', '0', '1', '372', '1', '372', '0.0', '754']
>>> (_[0], _[1], _[5], _[7])
('c0_g1_i1', 'm.1', 'Q9HGP0.1', '100.00')