Mark Galeck Mark Galeck - 3 months ago 10
Bash Question

How are Linux shells and filesystem Unicode-aware?

I understand that Linux filesystem stores file names as byte sequences which is meant to be Unicode-encoding-independent.

But, encodings other than UTF-8 or Enhanced UTF-8 may very well use 0 byte as part of a multibyte representation of a Unicode character that can appear in a file name. And everywhere in Linux filesystem C code you terminate strings with 0 byte. So how does Linux filesystem support Unicode? Does it assume all applications that create filenames use UTF-8 only? But that is not true, is it?

Similarly, the shells (such as bash) use

*
in patterns to match any number of filename characters. I can see in shell C code that it simply uses the ASCII byte for
*
and goes byte-by-byte to delimit the match. Fine for UTF-8 encoded names, because it has the property that if you take the byte representation of a string, then match some bytes from the start with
*
, and match the rest with another string, then the bytes at the beginning in fact matched a string of whole characters, not just bytes.

But other encodings do not have that property, do they? So again, do shells assume UTF-8?

Answer

There is no encoding of filenames in Linux (in ext family of filesystems at any rate). Filenames are sequences of bytes, not characters. It is up to application programs to interpret these bytes as UTF-8 or anything else. The filesystem doesn't care.

POSIX stipulates that the shell obeys the locale environment vsriables such as LC_CTYPE when performing pattern matches. Thus, pattern-matching code that just compares bytes regardless of the encoding would not be compatible with your hypothetical encoding, or with any stateful encoding. But this doesn't seem to matter much as such encodings are not commonly supported by existing locales. UTF-8 on the other hand seems to be supported well: in my experiments bash correctly matches the ? character with a single Unicode character, rather than a single byte, in the filename (given an UTF-8 locale) as prescribed by POSIX.

Comments