Eman Eman - 3 months ago 26
Python Question

regexp_tokenize and Arabic text

I'm using

to return tokens from an Arabic text without any punctuation marks:

import re,string,sys
from nltk.tokenize import regexp_tokenize

def PreProcess_text(Input):
tokens=regexp_tokenize(Input, r'[،؟!.؛]\s*', gaps=True)
return tokens

H = raw_input('H:')
Cleand= PreProcess_text(H)
print '\n'.join(Cleand)


It worked fine, but the problem is when I try to print the text.

The output for the text
ايمان،سعد
:

?يم

?
?
?


but if the text is in English, even with an Arabic punctuation marks, it prints the right result.

The output for the text
hi،eman
:

hi
eman

Answer

When you use raw_input, the symbols are coded as bytes.

You need to convert it into a Unicode string with

H.decode('utf8')

And you may keep your regex:

tokens=regexp_tokenize(Input, r'[،؟!.؛]\s*', gaps=True)