Eman Eman - 1 year ago 120
Python Question

regexp_tokenize and Arabic text

I'm using

to return tokens from an Arabic text without any punctuation marks:

import re,string,sys
from nltk.tokenize import regexp_tokenize

def PreProcess_text(Input):
tokens=regexp_tokenize(Input, r'[،؟!.؛]\s*', gaps=True)
return tokens

H = raw_input('H:')
Cleand= PreProcess_text(H)
print '\n'.join(Cleand)

It worked fine, but the problem is when I try to print the text.

The output for the text



but if the text is in English, even with an Arabic punctuation marks, it prints the right result.

The output for the text


Answer Source

When you use raw_input, the symbols are coded as bytes.

You need to convert it into a Unicode string with


And you may keep your regex:

tokens=regexp_tokenize(Input, r'[،؟!.؛]\s*', gaps=True)
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download