Andrew Fount Andrew Fount - 1 month ago 17
Python Question

Detect same words using different alphabets?

Python treats words

МАМА
and
MAMA
differently because one of them is written using latin and another using cyrillian.

How to make python treat them as one same string?

I only care about allomorphs.

Answer

Transliteration is not going to help (it will turn Cyrillic P into Latin R). At first glance, Unicode compatibility form (NFKD or NFKC) look hopeful, but that turns U+041C (CYRILLIC CAPITAL LETTER EM) into U+041C (and not U+004D (LATIN CAPITAL LETTER EM)) - so that won't work.

The only solution is to build your own table of allomorphs, and translate all strings into a canonical form before comparing.

Note: When I said "Cyrillic P", I cheated and used the Latin allomorph - I don't have an easy way to enter Cyrillic.