I have this Python script that takes the info of a webpage and then saves this info to a text file. But the name of this text file changes from time to time and it can changes to Cyrillic letters sometimes, and some times Korean letters.
The problem is that say I'm trying to save the file with the name "бореиская" then the name will appear very weird when I'm viewing it in Windows.
I'm guessing I need to change some encoding at some places. But the name is being sent to the
server = "бореиская"
file = open("eu_" + server + ".lua", "w")
Always use Unicode strings for file names and paths. E.g.:
In your case:
server = u"бореиская" file = open(u"eu_" + server + ".lua", "w")
server will come from another location, so you will need to ensure that it's decoded to a Unicode string correctly. See
Windows stores filenames using UTF-16. The Windows i/o API and Python hides this detail but requires Unicode strings, else a string will have to use the correct 8bit codepage.
Filenames can be made from any byte string, in any encoding, as long as it's not ASCII "." or "..". As each system user can have their own encoding, you really can't guarantee the encoding one user used is the same as another. The
locale is used to configure each user's environment. The user's terminal encoding also needs to match the encoding for consistency.
The best that can be hoped is that a user hasn't changed their locale and all applications are using the same locale. For example, the default locale may be:
en_GB.UTF-8, meaning the encoding of files and filenames should be UTF-8.
When Python encounters a Unicode filename, it will use the user's locale to decode/encode filenames. An encoded string will be passed directly to the kernel, meaning you may get lucky with using "UTF-8" filenames.
OS X's filenames are always UTF-8 encoded, regardless of the user's locale. Therefore, a filename should be a Unicode string, but may be encoded in the user's locale and will be translated. As most user's locales are
*.UTF-8, this means you can actually pass a UTF-8 encoded string or a Unicode string.
For best cross-platform compatibility, always use Unicode strings as in most cases they will be translated to the correct encoding. It's really just Linux that has the most ambiguity, as some applications may choose to ignore the default locale or a user may have changed their locale to a non-UTF-8 version.