AlexLordThorsen AlexLordThorsen - 2 months ago 6
Python Question

Why is 'é' and 'é' encoding to different bytes?

Question



Why is the same character encoding to different bytes in different parts of my code base?

Context



I have a unit test that generates a temporary file tree and then checks to make sure my scan actually finds the file in question.

def test_unicode_file_name():
test_regex = "é"
file_tree = {"files": ["é"]} # File created with python.open()
with TempTree(file_tree) as tmp_tree:
import pdb; pdb.set_trace()
result = tasks.find_files(test_regex, root_path=tmp_tree.root_path)
expected = [os.path.join(tmp_tree.root_path, "é")]
assert result == expected


Function that's failing



for dir_entry in scandir(current_path):
if dir_entry.is_dir():
dirs_to_search.append(dir_entry.path)

if dir_entry.is_file():
testing = dir_entry.name
if filename_regex.match(testing):
results.append(dir_entry.path)


PDB Session



When I started digging into things I found that the test character (copied from my unit test) and the character in
dir_entry.name
encoded to different bytes.

(Pdb) testing
'é'
(Pdb) 'é'
'é'
(Pdb) testing == 'é'
False
(Pdb) testing in 'é'
False
(Pdb) type(testing)
<class 'str'>
(Pdb) type('é')
<class 'str'>
(Pdb) repr(testing)
"'é'"
(Pdb) repr('é')
"'é'"
(Pdb) 'é'.encode("utf-8")
b'\xc3\xa9'
(Pdb) testing.encode("utf-8")
b'e\xcc\x81'

Answer

Your operating system (MacOS, at a guess) has converted the filename 'é' to Unicode Normal Form D, decomposing it into an unaccented 'e' and a combining acute accent. You can see this clearly with a quick session in the Python interpreter:

>>> import unicodedata
>>> e1 = b'\xc3\xa9'.decode()
>>> e2 = b'e\xcc\x81'.decode()
>>> [unicodedata.name(c) for c in e1]
['LATIN SMALL LETTER E WITH ACUTE']
>>> [unicodedata.name(c) for c in e2]
['LATIN SMALL LETTER E', 'COMBINING ACUTE ACCENT']

To ensure that you're comparing like with like, you can convert the filename given by dir_entry.name back to Normal Form C before testing it against your regex:

import unicodedata

for dir_entry in scandir(current_path):
    if dir_entry.is_dir():
        dirs_to_search.append(dir_entry.path)

    if dir_entry.is_file():
        testing = unicodedata.normalize('NFC', dir_entry.name)
        if filename_regex.match(testing):
            results.append(dir_entry.path)
Comments