supermario supermario - 28 days ago 17
Python Question

How to get the Unicode representation of Arabic strings in Django?

I'm wondering how to get the Unicode representation of Arabic strings like

سلام
in Python?

The result should be
\u0633\u0644\u0627\u0645


I need that so that I can compare texts retrieved from mysql db and data stored in redis cache.

Answer

Assuming you have an actual Unicode string, you can do

# -*- coding: utf-8 -*-
s = u'سلام'
print s.encode('unicode-escape')    

output

\u0633\u0644\u0627\u0645

The # -*- coding: utf-8 -*- directive is purely to tell the interpreter that the source code is UTF-8 encoded, it has no bearing on how the script itself handles Unicode.


If your script is reading that Arabic string from a UTF-8 encoded source, the bytes will look like this:

\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85

You can convert that to Unicode like this:

data = '\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85'
s = data.decode('utf8')
print s
print s.encode('unicode-escape')  

output

سلام
\u0633\u0644\u0627\u0645

Of course, you do need to make sure that your terminal is set up to handle Unicode properly.

Note that

'\u0633\u0644\u0627\u0645'

is a plain (byte) string containing 24 bytes, whereas

u'\u0633\u0644\u0627\u0645'

is a Unicode string containing 4 Unicode characters.

You may find this article helpful: Pragmatic Unicode, which was written by SO veteran Ned Batchelder.

Comments