view raw
David David - 9 months ago 25
Python Question

Processing non-english text

I have a python file that reads a file given by the user, processes it, and ask questions in flash card format. The program works fine with an english txt file but I encounter errors when trying to process a french file.

When I first encountered the error, I was using the windows command prompt window and running

. When inputting the french file, I immediately got a
. After digging around, I found that it may have something to do with the fact I was using the cmd window. So I tried using IDLE. I didn't get any errors but I would get weird characters like

Upon further research, I found some documentation that instructs to use
encoding='insert encoding type'
in the
part of my code. After running the program again in IDLE, it seemed to minimize the problem, but I would still get some weird characters. When running it in the cmd, it wouldn't break IMMEDIATELY, but would eventually when it encountered an unknown character.

My question: what do I implement to ensure the program can handle ALL of the chaaracters in the file (given any language) and why does IDLE and the command prompt handle the file differently?

EDIT: I forgot to mention that I ended up using utf-8 which gave the results I described.


It's common question. Seems that you're using cmd which doesn't support unicode, so error occurs during translation of output to the encoding, which your cmd runs. And as unicode has a wider charset, than encoding used in cmd, it gives an error

IDLE is built ontop of tkinter's Text widget, which perfectly supports Python strings in unicode.

And, finally, when you specify a file you'd like to open, the open function assumes that it's in platform default (per sys.getdefaultencoding()). So if your file encoding differs, you should exactly mention it in keyword arg encoding to open func. Alternatively you could call something like sys.setdefaultencoding('utf-8'), but be careful, it may break some poorly designed libs code used.