I have a list of german words and I want to eliminate all nouns therefore I look after the first letter being uppercase or lowercase. This works for all words except for the words that begin with an umlaut e.g.
# -*- coding: utf-8 -*-
dictionary = open('dictionary/de.dict', 'r')
for line in dictionary:
if line == "Ä": # This does not work
print "Ä found"
The utf8-encoded string
"Ä" consists of two characters:
>>> "Ä" '\xc3\x84'
The unicode string
u"Ä" is only one.
You have to encode the strings correctly. So if your dictionary is encoded in utf-8 use:
import io dictionary = io.open('dictionary/de.dict', encoding='utf8') for line in dictionary: if line.isupper(): print "Uppercase word", line