sun eye - 1 year ago 78
Python Question

# calculate the totall number of uniqe words of the first column of a list

I have a file that consists of many Persian sentences. each line contains a sentence, then a "tab", then a word, again a "tab" and then an English word. I have to know just the number of unique words of the sentences (the words after tabs should not be in calculation). For that I changed the file to a list, so I have a list that contains a lot of lines and each line contains three indices; the sentence, a Persian word, an English word. Now I can achieve the sentences. The problem is that, the code I wrote returns the number of unique words of each line separately. For example if the file has 100 lines it returns 100 numbers, each in a new line. But I want the summation of all the numbers and have just one number which shows the total number of unique words. How can I change the code?

``````from hazm import*

def WordsProbs (file):
with open (file, encoding = "utf-8") as f1:
normalizer = Normalizer()
for line in f1:
tmp = line.strip().split("\t")
tmp[0] = normalizer.normalize(tmp[0])
corpus.append(tmp)
for row in corpus:
UniqueWords = len(set(row[0].split()))
print (UniqueWords)
``````

The sample data:

باد بارش برف وزش باد، کولاک یخبندان سطح wind

Assuming tmp[0] contains the sentence from each line, the individual words in the sentence can be counted without building a corpus.

``````from hazm import*

def WordsProbs (file):
words = set()
with open (file, encoding = "utf-8") as f1:
normalizer = Normalizer()
for line in f1:
tmp = line.strip().split("\t")
words.update(set(normalizer.normalize(tmp[0].split())))
print(len(words), "unique words")
``````

I can't test it because on my machine, the English word "wind" shows up in the first column after cutting and pasting your sample data.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download