Eman Eman - 9 months ago 82
Python Question

regexp_tokenize and Arabic text

I'm using

to return tokens from an Arabic text without any punctuation marks:

import re,string,sys
from nltk.tokenize import regexp_tokenize

def PreProcess_text(Input):
tokens=regexp_tokenize(Input, r'[،؟!.؛]\s*', gaps=True)
return tokens

H = raw_input('H:')
Cleand= PreProcess_text(H)
print '\n'.join(Cleand)

It worked fine, but the problem is when I try to print the text.

The output for the text



but if the text is in English, even with an Arabic punctuation marks, it prints the right result.

The output for the text



When you use raw_input, the symbols are coded as bytes.

You need to convert it into a Unicode string with


And you may keep your regex:

tokens=regexp_tokenize(Input, r'[،؟!.؛]\s*', gaps=True)