dilip dilip - 6 months ago 9
Python Question

Split on space but not if there is a colon followed by space or if there is a space in quotes

I have a string like this

str = 'name: phil age : 23 range: 33, 45 address: "main ave US"'


to be tokenized as

['name: phil', 'age : 23', 'range: 33, 45' 'address: "main ave US"']

Answer

This is very brittle, the regex might break if the format changes even little:

Sample string 1

>>> import re
>>> str = 'name: phil age : 23 range: 33, 45 address: "main ave US"' 
>>> re.findall(r'\w+\s*:\s*(?:"[^"]*"|.*?(?=\w+\s*:\s*|$))', str)
['name: phil ', 'age : 23 ', 'range: 33, 45 ', 'address: "main ave US"']

Sample string 2

>>> str = 'name: phil age : 23 range: 33, 45 address: "main ave US" abcd : xyz' 
>>> re.findall(r'\w+\s*:\s*(?:"[^"]*"|.*?(?=\w+\s*:\s*|$))', str)
['name: phil ', 'age : 23 ', 'range: 33, 45 ', 'address: "main ave US"', 'abcd : xyz']

Sample string 3

>>> str = 'name: phil age : 23 range: 33, 45'
>>> re.findall(r'\w+\s*:\s*(?:"[^"]*"|.*?(?=\w+\s*:\s*|$))', str)
['name: phil ', 'age : 23 ', 'range: 33, 45']

To trim the leading and trailing spaces of each match you can use this:

>>> list(map(lambda x:x.strip(), re.findall(r'\w+\s*:\s*(?:"[^"]*"|.*?(?=\w+\s*:\s*|$))', str)))
['name: phil', 'age : 23', 'range: 33, 45']

Regex used is: \w+\s*:\s*(?:"[^"]*"|.*?(?=\w+\s*:\s*|$))