user1928896 user1928896 - 8 months ago 38
Python Question

How to split unicode strings character by character in python?

My website supports a number of Indian languages. The user can change the language dynamically. When user inputs some string value, I have to split the string value into its individual characters. So, I'm looking for a way to write a common function that will work for English and a select set of Indian languages. I have searched across sites, however, there appears to be no common way to handle this requirement. There are language-specific implementations (for example Open-Tamil package for Tamil implements get_letters) but I could not find a common way to split or iterate through the characters in a unicode string taking the graphemes into consideration.

One of the many methods that I've tried:

name = u'தமிழ்'
print name
for i in list(name):
print i

#expected output
தமிழ்

மி
ழ்

#actual output
தமிழ்


ி



#Here is another an example using another Indian language
name = u'हिंदी'
print name
for i in list(name):
print i

#expected output
हिंदी
हिं
दी

#actual output
हिंदी

ि




Answer Source

The way to solve this is to group all "L" category characters with their subsequent "M" category characters:

>>> regex.findall(ur'\p{L}\p{M}*', name)
[u'\u0ba4', u'\u0bae\u0bbf', u'\u0bb4\u0bcd']
>>> for c in regex.findall(ur'\p{L}\p{M}*', name):
...   print c
... 
த
மி
ழ்

regex