pietv8x pietv8x - 1 year ago 42
Python Question

Regex avoid "rest of string" split results

I have this code to split a complicated CSV file into chunks. The hard bit is that commas may also appear within "" and thus those must not be split on. The regex I am using to find commas not within "" works fine:

comma_re = re.compile(r',(?=([^"]*""[^"]*"")*[^"]*$)')

Demo: here

import re

test = 'Test1,Test2,"",Test3,Test4"",Test5'
comma_re = re.compile(r',(?=([^"]*""[^"]*"")*[^"]*$)')

print comma_re.split(test)


['Test1', 'Test2,"",Test3,Test4""', 'Test2', '"",Test3,Test4""', '"",Test3,Test4""', None, 'Test5']

['Test1', 'Test2', '"",Test3,Test4""', 'Test5']

How can I avoid the useless split results?

Thanks in advance!

Stupid me didn't even know about a default csv module, continued using that. Thanks for you efforts!


Will work for the example you gave, although it won't work if the input differs from that format.

input = 'Test1,Test2,"",Test3,Test4"",Test5'
output = re.split(r'(?<!"),(?![^",]+")|,(?=[^"]*$)', input)

# ['Test1', 'Test2', '"",Test3,Test4""', 'Test5']

Python demo

You should really be using a CSV parser for this. If you can't for some reason - just do some manual string processing, going through character by character and splitting when you see a comma, unless you have recognised you are in a quoted string. Something like the following:

input = 'Test1,Test2,"",Test3,Test4"",Test5'

insideQuoted = False
output = []
lastIndex = 0

for i in range(0, len(input)):
    if input[i] == ',' and not insideQuoted:
        output.append(input[lastIndex: i])
        lastIndex = i + 1
    elif input[i] == '"' and i < len(input) - 1 and input[i + 1] == '"':
        insideQuoted ^= True
    elif i == len(input) - 1:
        output.append(input[lastIndex: i + 1])