Arjen Meek Arjen Meek - 1 month ago 25
Python Question

Processing non-UTF-8 Posix filenames using Python pathlib?

I'm trying to use the pathlib module that became part of the standard library in Python 3.4+ to find and manipulate file paths. Although it's an improvement over the os.path style functions to be able to treat paths in an object-oriented way, I'm having trouble dealing with some more exotic filenames on Posix filesystems; specifically files whose names contain bytes that cannot be decoded as UTF-8:

>>> pathlib.PosixPath(b'\xe9')

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.5/pathlib.py", line 969, in __new__
self = cls._from_parts(args, init=False)
File "/usr/lib/python3.5/pathlib.py", line 651, in _from_parts
drv, root, parts = self._parse_args(args)
File "/usr/lib/python3.5/pathlib.py", line 643, in _parse_args
% type(a))
TypeError: argument should be a path or str object, not <class 'bytes'>

>>> b'\xe9'.decode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 0: unexpected end of data


The problem with this is that on a Posix filesystem, such files can exist, and I'd like to be able to process any filesystem-valid filenames in my application rather than cause errors and/or upredictable behaviour.

I can get a PosixPath object for such files inside a directory by using the .iterdir() method of the parent directory. But I have yet to find a way to get it from a full path that was provided as a variable of type 'bytes', which is rather hard to avoid when loading paths from another source which fully supports all filesystem-valid raw byte values (such as a database or a file containing nul-separated paths).

Is there a way to do this that I'm not aware of? Or, if it's really not possible: is this by design, or could it be considered a deficiency in the standard library that might warrant a bug report?

I did find a related bug report, but that issue concerned documentation incorrectly mentioning that arguments of class 'bytes' were allowed.

wim wim
Answer Source

I think you can get what you want like this:

import os
PosixPath(os.fsdecode(b'\xe9'))

Demo:

>>> import os, pathlib
>>> b = b'\xe9'
>>> p = pathlib.Path(os.fsdecode(b))
>>> p.exists()
False
>>> with open(b, mode='w') as f:
...     f.write('wacky filename')
...     
>>> p.exists()
True
>>> p.read_bytes()
b'wacky filename'
>>> os.listdir(b'.')
[b'\xe9']