Shristi Baral Shristi Baral - 2 months ago 8
Python Question

Make list of unicode words that are in a file

My code is

f = codecs.open(r'C:\Users\Admin\Desktop\nepali.txt', 'r', 'UTF-8')
nepali = f.read().split()
for i in nepali:
print i


Display the words in file:

यो
किताब
टेबुल
मा

यो
एक
किताब
हो
केटा


But when I try to create a list of the words with code:

file=codecs.open(r'C:\Users\Admin\Desktop\nepali.txt', 'r', 'UTF-8')
nepali = list(file.read().split())
print nepali


The output now is displayed like this

[u'\ufeff\u092f\u094b', u'\u0915\u093f\u0924\u093e\u092c', u'\u091f\u0947\u092c\u0941\u0932', u'\u092e\u093e', u'\u091b', u'\u092f\u094b', u'\u090f\u0915', u'\u0915\u093f\u0924\u093e\u092c', u'\u0939\u094b',]


The output should look like:

[यो, किताब, टेबुल, मा, छ,यो, एक, किताब, हो]

Answer

You are looking at the output of the repr() function, which is always used for displaying the contents of containers. The output is meant for debugging, not end-user displays; any non-printable non-ASCII codepoint is represented by an escape sequence (which can, depending on the codepoint, be a single character escape like \t or \n, or use 2, 4, or 8 hex digits, like \xe5, \u2603 or \U0001f4e2).

You'll have to produce the output manually:

print u'[{}]'.format(u', '.join(nepali))

This produces a unicode string formatted to look like a list object, but without using repr(), simply by adding square brackets around the strings, joined with ', ' (comma and space).

Demo:

>>> nepali = [u'\ufeff\u092f\u094b', u'\u0915\u093f\u0924\u093e\u092c', u'\u091f\u0947\u092c\u0941\u0932', u'\u092e\u093e', u'\u091b', u'\u092f\u094b', u'\u090f\u0915', u'\u0915\u093f\u0924\u093e\u092c', u'\u0939\u094b',]
>>> print u'[{}]'.format(u', '.join(nepali))
[यो, किताब, टेबुल, मा, छ, यो, एक, किताब, हो]

However, if you want to show this to an end-user, why use the square brackets at all?