Waheeb Al-Abyadh Waheeb Al-Abyadh - 3 months ago 9
Python Question

Handling Arabic text in Python

I have the following Python 2.7 code:

mydoclist = ['جوليا تحبني اكثر من ليندا','جين تحبني اكثر من جوليا','احمد يحب كرة السلة اكثر من كرة الطاولة']

from collections import Counter

for doc in mydoclist:
tf = Counter()
for word in doc.split():
tf[word] +=1
print tf.items()


I got the following output:

[(u'\u062a\u062d\u0628\u0646\u064a', 1), (u'\u0645\u0646', 1), (u'\u062c \u0648\u0644\u064a\u0627', 1), (u'\u0644\u064a\u0646\u062f\u0627', 1), (u'\u0627\u0643\u062b\u0631', 1)]
[('\xd8\xac\xd9\x8a\xd9\x86', 1), ('\xd9\x85\xd9\x86', 1), ('\xd8\xac\xd9\x88\xd9\x84\xd9\x8a\xd8\xa7', 1), ('\xd8\xaa\xd8\xad\xd8\xa8\xd9\x86\xd9\x8a', 1), ('\xd8\xa7\xd9\x83\xd8\xab\xd8\xb1', 1)]
[('\xd8\xa7\xd9\x83\xd8\xab\xd8\xb1', 1), ('\xd8\xa7\xd8\xad\xd9\x85\xd8\xaf', 1), ('\xd9\x8a\xd8\xad\xd8\xa8', 1), ('\xd8\xa7\xd9\x84\xd8\xb7\xd8\xa7\xd9\x88\xd9\x84\xd8\xa9', 1), ('\xd9\x83\xd8\xb1\xd8\xa9', 2), ('\xd8\xa7\xd9\x84\xd8\xb3\xd9\x84\xd8\xa9', 1), ('\xd9\x85\xd9\x86', 1)]


Why I can not see Arabic words. I want to see Arabic words instead of these codes that appear in the output. Thanks.

Answer

Python prints lists so that all items in them are passed through repr which in turn produces this stuff with "\u...". Also have a look at the tutorial section about unicode-strings or better the unicode HOWTO they helped me a lot. For sourcecode containing non-ascii characters you should set an encoding (most likely "utf-8"). Also you propably want to mark strings containing such characters as unicode (u"..." instead of "...")

# -*- coding: utf-8 -*-

from collections import Counter


mydoclist = [u'جوليا تحبني اكثر من ليندا',u'جين تحبني اكثر من جوليا',u'احمد يحب كرة السلة اكثر من كرة الطاولة']



for doc in mydoclist:
     tf = Counter()
     for word in doc.split():
         tf[word] +=1
     print u", ".join( u"(%i: %s)"%(n,s) for (s,n) in tf.items())

works for me.