user6952520 user6952520 - 27 days ago 11
Python Question

Splitting in Python with everything BUT particular set of cases

I am not very good with regex and it continues to confuse me every time it comes up so instead of writing a possibly incorrect regex string, I want to split a string a different way.

Let's say I have a string "hello, my name is Joseph! Haha, hello!" and I want to split it whenever I encounter a non-alphanumeric character. So then, in this case, I would obtain:

"hello"
"my"
"name"
"is"
"Joseph"
"Haha"
"hello"

Is there a way to do this without a regex string? As in: split whenever character != alphanumeric?

(Yes, I do realize it is probably not a smart thing to do to not correct my regex deficiency!)

Answer

Personally, I think it is appropriate to use simple and straightforward regexes for such simple tasks.

Compare an itertools and re solutions:

import itertools, re
s = "hello, my name is Joseph! Haha, hello!"
print(["".join(x) for _, x in itertools.groupby(s, key=str.isalnum)][0::2])
print(re.findall(r"\w+", s))

See an online Python demo here.

As for me, I'd vote for the regex here. The \w+ matches one or more word characters (letters, digits, underscores) and the re.findall returns all the non-overlapping occurrences.

The itertools groupby groups the substring chunks according to the key which is set to alphanumeric (str.alnum) and all the even tokens (the non-word chunks in this concrete case) are removed from the final result with [0::2]. If a string starts with a non-word char, this won't work, a regex solution is safer and easier.