Shahinism Shahinism - 4 months ago 50
Python Question

String.maketrans for English and Persian numbers

I have a function like this:

persian_numbers = '۱۲۳۴۵۶۷۸۹۰'
english_numbers = '1234567890'
arabic_numbers = '١٢٣٤٥٦٧٨٩٠'

english_trans = string.maketrans(english_numbers, persian_numbers)
arabic_trans = string.maketrans(arabic_numbers, persian_numbers)

text.translate(english_trans)
text.translate(arabic_trans)


I want it to translate all Arabic and English numbers to Persian. But Python says:

english_translate = string.maketrans(english_numbers, persian_numbers)
ValueError: maketrans arguments must have same length


I tried to encode strings with Unicode
utf-8
but I always got some errors! Sometimes the problem is Arabic string instead! Do you know a better solution for this job?

EDIT:



It seems the problem is Unicode characters length in ASCII. An Arabic number like '۱' is two character -- that I find out with
ord()
. And the length problem starts from here :-(

Answer

Unicode objects can interpret these digits (arabic and persian) as actual digits - no need to translate them by using character substitution.

EDIT - I came out with a way to make your replacement using Python2 regular expressions:

# coding: utf-8

import re

# Attention: while the characters for the strings bellow are 
# dislplayed indentically, inside they are represented
# by distinct unicode codepoints

persian_numbers = u'۱۲۳۴۵۶۷۸۹۰'
arabic_numbers  = u'١٢٣٤٥٦٧٨٩٠'
english_numbers = u'1234567890'


persian_regexp = u"(%s)" %  u"|".join(persian_numbers)
arabic_regexp = u"(%s)" % u"|".join(arabic_numbers)

def _sub(match_object, digits):
    return english_numbers[digits.find(match_object.group(0))]

def _sub_arabic(match_object):
    return _sub(match_object, arabic_numbers)

def _sub_persian(match_object):
    return _sub(match_object, persian_numbers)


def replace_arabic(text):
    return re.sub(arabic_regexp, _sub_arabic, text)

def replace_persian(text):
    return re.sub(arabic_regexp, _sub_persian, text)

Attempt that the "text" parameter must be unicode itself.

(also this code could be shortened by using lambdas and combining some expressions in a single line, but there is no point in doing so, but for loosing readability)

It should work to you up to here, but please read on the original answer I had posted

-- original answer

So, if you instantiate your variables as unicode (prepending an u to the quote char), they are correctly understood in Python:

>>> persian_numbers = u'۱۲۳۴۵۶۷۸۹۰'
>>> english_numbers = u'1234567890'
>>> arabic_numbers  = u'١٢٣٤٥٦٧٨٩٠'
>>> 
>>> print int(persian_numbers)
1234567890
>>> print int(english_numbers)
1234567890
>>> print int(arabic_numbers)
1234567890
>>> persian_numbers.isdigit()
True
>>> 

By the way, the "maketrans" method does not exist for unicode objects (in Python2 - see the comments).

It is very important to understand the basics about unicode - for everyone, even people writing English only programs who think they will never deal with any char out of the 26 latin letters. When writing code that will deal with different chars it is vital - the program can't possibly work without you knowing what you are doing except by chance.

A very good article to read is http://www.joelonsoftware.com/articles/Unicode.html - please read it now. You can keep in mind, while reading it, that Python allows one to translate unicode characters to a string in any "physical" encoding by using the "encode" method of unicode objects.

>>> arabic_numbers  = u'١٢٣٤٥٦٧٨٩٠'
>>> len(arabic_numbers)
10
>>> enc_arabic = arabic_numbers.encode("utf-8")
>>> print enc_arabic
١٢٣٤٥٦٧٨٩٠
>>> len(enc_arabic)
20
>>> int(enc_arabic)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '\xd9\xa1\xd9\xa2\xd9\xa3\xd9\xa4\xd9\xa5\xd9\xa6\xd9\xa7\xd9\xa8\xd9\xa9\xd9\xa0'

Thus, the characters loose their sense as "single entities" and as digits when encoding - the encoded object (str type in Python 2.x) is justa strrng of bytes - which nonetheless is needed when sending these characters to any output from the program - be it console, GUI Window, database, html code, etc...