gerrit gerrit - 1 month ago 6
Python Question

Why is `'↊'.isnumeric()` false?

According to the Official Unicode Consortium code chart, all of these are numeric:

⅐ ⅑ ⅒ ⅓ ⅔ ⅕ ⅖ ⅗ ⅘ ⅙ ⅚ ⅛ ⅜ ⅝ ⅞ ⅟
Ⅰ Ⅱ Ⅲ Ⅳ Ⅴ Ⅵ Ⅶ Ⅷ Ⅸ Ⅹ Ⅺ Ⅻ Ⅼ Ⅽ Ⅾ Ⅿ
ⅰ ⅱ ⅲ ⅳ ⅴ ⅵ ⅶ ⅷ ⅸ ⅹ ⅺ ⅻ ⅼ ⅽ ⅾ ⅿ
ↀ ↁ ↂ Ↄ ↄ ↅ ↆ ↇ ↈ ↉ ↊ ↋


However, when I ask Python to tell me which ones are numeric, they all are (even
) except for four:

In [252]: print([k for k in "⅐⅑⅒⅓⅔⅕⅖⅗⅘⅙⅚⅛⅜⅝⅞⅟ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩⅪⅫⅬⅭⅮⅯⅰⅱⅲⅳⅴⅵⅶⅷⅸⅹⅺⅻⅼⅽⅾⅿↀↁↂↃↄↅↆↇↈ↉↊↋" if not k.isnumeric()])
['Ↄ', 'ↄ', '↊', '↋']


Those are:


  • Ↄ Roman Numeral Reversed One Hundred

  • ↄ Latin Small Letter Reversed C

  • ↊ Turned Digit Two

  • ↋ Turned Digit Three



Why does Python consider those to be not numeric?

Answer

str.isnumeric is documented to be true for "all characters that have the Unicode numeric value property".

The canonical reference for that property is the Unicode Character Database. The information we need can be dug out of http://www.unicode.org/Public/9.0.0/ucd/UnicodeData.txt , which is the latest version at time of writing (late 2016) (warning: 1.5MB text file). It's a little tricky to read (the documentation is in UAX#44). I'm going to show its entry for a character that is numeric first, U+3023 HANGZHOU NUMERAL THREE ()

3023;HANGZHOU NUMERAL THREE;Nl;0;L;;;;3;N;;;;;

The eighth semicolon-separated field is the "numeric value" property; in this case, its value is 3, consistent with the name of the character. Python's str.isnumeric is true if and only if this field is nonempty. It can be interrogated directly using unicodedata.numeric.

The third semicolon-separated field is a two-character code giving the "general category"; in this case, "Nl". Most, but not all, of the characters with a numeric value are in one of the "number" categories (first letter of the category code is a N). The exceptions all appear to be anonymous hanzi.

Now, the characters you are asking about:

2183;ROMAN NUMERAL REVERSED ONE HUNDRED;Lu;0;L ;;;;;N;;;    ;2184;
2184;LATIN SMALL LETTER REVERSED C     ;Ll;0;L ;;;;;N;;;2183;    ;2183
218A;TURNED DIGIT TWO                  ;So;0;ON;;;;;N;;;    ;    ;
218B;TURNED DIGIT THREE                ;So;0;ON;;;;;N;;;    ;    ;

These characters do not have a numeric value assigned, so Python's behavior is correct-as-documented.

Note: per https://docs.python.org/3.6/whatsnew/3.6.html, Python will only be updated to Unicode 9.0.0 in the 3.6 release; however, AFAICT these characters have not changed in quite some time.

Comments