hiveship hiveship - 17 days ago 5
Python Question

Python split string and keep delimiters as a word

I am trying to split a string using multiple delimiters. I need to keep the delimiters as words.
The delimiters I am using are: all the punctuations marks and the space.

For example, the string:

Je suis, FOU et toi ?!


Should produce:

'Je'
'suis'
','
'FOU'
'et'
'toi'
'?'
'!'


I wrote:

class Parser :
def __init__(self) :
"""Empty constructor"""

def read(self, file_name) :
from string import punctuation
with open(file_name, 'r') as file :
for line in file :
for word in line.split() :
r = re.compile(r'[\s{}]+'.format(re.escape(punctuation)))
print(r.split(word))


But the result I got is:

['Je']
['suis', '']
['FOU']
['et']
['toi']
['', '']


The split seems to be correct, but the result list do not contains the delimiters :(

Answer

You need to put your expression into a group for re.split() to preserve it. I'd not split on whitespace first; you can always remove whitespace-only strings later. If you want each punctuation character separate then you should use the + quantifier on the \s whitespace group only:

# do this just once, not in a loop
pattern = re.compile(r'(\s+|[{}])'.format(re.escape(punctuation)))

# for each line
parts = [part for part in pattern.split(line) if part.strip()]

The list comprehension removes anything that consists only of whitespace:

>>> import re
>>> from string import punctuation
>>> line = 'Je suis, FOU et toi ?!'
>>> pattern = re.compile(r'(\s+|[{}])'.format(re.escape(punctuation)))
>>> pattern.split(line)
['Je', ' ', 'suis', ',', '', ' ', 'FOU', ' ', 'et', ' ', 'toi', ' ', '', '?', '', '!', '']
>>> [part for part in pattern.split(line) if part.strip()]
['Je', 'suis', ',', 'FOU', 'et', 'toi', '?', '!']

Rather than split, you can also use re.findall() to find all word or punctuation sequences:

pattern = re.compile(r'\w+|[{}]'.format(re.escape(punctuation)))

parts = pattern.findall(line)

This has the advantage that you don't need to filter out whitespace:

>>> pattern = re.compile(r'\w+|[{}]'.format(re.escape(punctuation)))
>>> pattern.findall(line)
['Je', 'suis', ',', 'FOU', 'et', 'toi', '?', '!']