AlexLordThorsen AlexLordThorsen - 6 months ago 24
Python Question

Why is 'é' and 'é' encoding to different bytes?

Question



Why is the same character encoding to different bytes in different parts of my code base?

Context



I have a unit test that generates a temporary file tree and then checks to make sure my scan actually finds the file in question.

def test_unicode_file_name():
test_regex = "é"
file_tree = {"files": ["é"]} # File created with python.open()
with TempTree(file_tree) as tmp_tree:
import pdb; pdb.set_trace()
result = tasks.find_files(test_regex, root_path=tmp_tree.root_path)
expected = [os.path.join(tmp_tree.root_path, "é")]
assert result == expected


Function that's failing



for dir_entry in scandir(current_path):
if dir_entry.is_dir():
dirs_to_search.append(dir_entry.path)

if dir_entry.is_file():
testing = dir_entry.name
if filename_regex.match(testing):
results.append(dir_entry.path)


PDB Session



When I started digging into things I found that the test character (copied from my unit test) and the character in
dir_entry.name
encoded to different bytes.

(Pdb) testing
'é'
(Pdb) 'é'
'é'
(Pdb) testing == 'é'
False
(Pdb) testing in 'é'
False
(Pdb) type(testing)
<class 'str'>
(Pdb) type('é')
<class 'str'>
(Pdb) repr(testing)
"'é'"
(Pdb) repr('é')
"'é'"
(Pdb) 'é'.encode("utf-8")
b'\xc3\xa9'
(Pdb) testing.encode("utf-8")
b'e\xcc\x81'

Answer

Your operating system (MacOS, at a guess) has converted the filename 'é' to Unicode Normal Form D, decomposing it into an unaccented 'e' and a combining acute accent. You can see this clearly with a quick session in the Python interpreter:

>>> import unicodedata
>>> e1 = b'\xc3\xa9'.decode()
>>> e2 = b'e\xcc\x81'.decode()
>>> [unicodedata.name(c) for c in e1]
['LATIN SMALL LETTER E WITH ACUTE']
>>> [unicodedata.name(c) for c in e2]
['LATIN SMALL LETTER E', 'COMBINING ACUTE ACCENT']

To ensure that you're comparing like with like, you can convert the filename given by dir_entry.name back to Normal Form C before testing it against your regex:

import unicodedata

for dir_entry in scandir(current_path):
    if dir_entry.is_dir():
        dirs_to_search.append(dir_entry.path)

    if dir_entry.is_file():
        testing = unicodedata.normalize('NFC', dir_entry.name)
        if filename_regex.match(testing):
            results.append(dir_entry.path)