Thomja Thomja - 6 months ago 70
Python Question

Save file with Russian letters in the file name

I have this Python script that takes the info of a webpage and then saves this info to a text file. But the name of this text file changes from time to time and it can changes to Cyrillic letters sometimes, and some times Korean letters.

The problem is that say I'm trying to save the file with the name "бореиская" then the name will appear very weird when I'm viewing it in Windows.

I'm guessing I need to change some encoding at some places. But the name is being sent to the

open()
function:

server = "бореиская"
file = open("eu_" + server + ".lua", "w")


I am, earlier on, taking the server variable from an array that already has all the names in it.

But as previously mentioned, in Windows, the names appear with some very weird characters.

Answer

tl;dr

Always use Unicode strings for file names and paths. E.g.:

io.open(u"myfile€.txt")
os.listdir(u"mycrazydirß")

In your case:

server = u"бореиская"
file = open(u"eu_" + server + ".lua", "w")

I assume server will come from another location, so you will need to ensure that it's decoded to a Unicode string correctly. See io.open().

Explanation

Windows

Windows stores filenames using UTF-16. The Windows i/o API and Python hides this detail but requires Unicode strings, else a string will have to use the correct 8bit codepage.

Linux

Filenames can be made from any byte string, in any encoding, as long as it's not ASCII "." or "..". As each system user can have their own encoding, you really can't guarantee the encoding one user used is the same as another. The locale is used to configure each user's environment. The user's terminal encoding also needs to match the encoding for consistency.

The best that can be hoped is that a user hasn't changed their locale and all applications are using the same locale. For example, the default locale may be: en_GB.UTF-8, meaning the encoding of files and filenames should be UTF-8.

When Python encounters a Unicode filename, it will use the user's locale to decode/encode filenames. An encoded string will be passed directly to the kernel, meaning you may get lucky with using "UTF-8" filenames.

OS X

OS X's filenames are always UTF-8 encoded, regardless of the user's locale. Therefore, a filename should be a Unicode string, but may be encoded in the user's locale and will be translated. As most user's locales are *.UTF-8, this means you can actually pass a UTF-8 encoded string or a Unicode string.

Roundup

For best cross-platform compatibility, always use Unicode strings as in most cases they will be translated to the correct encoding. It's really just Linux that has the most ambiguity, as some applications may choose to ignore the default locale or a user may have changed their locale to a non-UTF-8 version.