view raw
NightFury13 NightFury13 - 5 months ago 27
Python Question

Python: convert space (and other special chars like . , ) to their corresponding unicode representations

I am trying to convert an Arabic phrase into its corresponding unicode representation string and it works fine for the Arabic text.

>>> a = ' مساء الخير'
>>> a.strip().decode('utf-8').encode('unicode-escape')
'\\u0645\\u0633\\u0627\\u0621 \\u0627\\u0644\\u062e\\u064a\\u0631'

However, I also want the space-character to be converted to its unicode representation ('\u0020'). I am observing similar behaviour with other characters like '.', ',', etc. I finally want to obtain the unicode values of each of the characters in my string as a list (simply splitting the current string with the delimiter '\u' gives me the incorrect split as the space character becomes combined with the previous unicode representation)

>>> a.strip().decode('utf-8').encode('unicode-escape').split('\\u')
['', '0645', '0633', '0627', '0621 ', '0627', '0644', '062e', '064a', '0631']

eg. I want [ ... '0621', '0020' ...] instead of the current [ ... '0621 ' ...]


It it fine to strip the first space in you do not need it, but if you want to keep the other, it would be simpler to build a list of unicode characters from the string and individually process the characters:

[ '%04x' % (ord(i),) for i in a.strip().decode('utf8') ]

or if you prefere to use format (which is now better)

[ '{0:04x}'.format(ord(i)) for i in a.strip().decode('utf8') ]

Both yield as expected:

['0645', '0633', '0627', '0621', '0020', '0627', '0644', '062e', '064a', '0631']