pietv8x pietv8x - 1 year ago 49
Python Question

Regex avoid "rest of string" split results

I have this code to split a complicated CSV file into chunks. The hard bit is that commas may also appear within "" and thus those must not be split on. The regex I am using to find commas not within "" works fine:

comma_re = re.compile(r',(?=([^"]*""[^"]*"")*[^"]*$)')

Demo: here

import re

test = 'Test1,Test2,"",Test3,Test4"",Test5'
comma_re = re.compile(r',(?=([^"]*""[^"]*"")*[^"]*$)')

print comma_re.split(test)


['Test1', 'Test2,"",Test3,Test4""', 'Test2', '"",Test3,Test4""', '"",Test3,Test4""', None, 'Test5']

['Test1', 'Test2', '"",Test3,Test4""', 'Test5']

How can I avoid the useless split results?

Thanks in advance!

Stupid me didn't even know about a default csv module, continued using that. Thanks for you efforts!

Answer Source

Will work for the example you gave, although it won't work if the input differs from that format.

input = 'Test1,Test2,"",Test3,Test4"",Test5'
output = re.split(r'(?<!"),(?![^",]+")|,(?=[^"]*$)', input)

# ['Test1', 'Test2', '"",Test3,Test4""', 'Test5']

Python demo

You should really be using a CSV parser for this. If you can't for some reason - just do some manual string processing, going through character by character and splitting when you see a comma, unless you have recognised you are in a quoted string. Something like the following:

input = 'Test1,Test2,"",Test3,Test4"",Test5'

insideQuoted = False
output = []
lastIndex = 0

for i in range(0, len(input)):
    if input[i] == ',' and not insideQuoted:
        output.append(input[lastIndex: i])
        lastIndex = i + 1
    elif input[i] == '"' and i < len(input) - 1 and input[i + 1] == '"':
        insideQuoted ^= True
    elif i == len(input) - 1:
        output.append(input[lastIndex: i + 1])