sun eye sun eye - 2 months ago 6
Python Question

calculate the totall number of uniqe words of the first column of a list

I have a file that consists of many Persian sentences. each line contains a sentence, then a "tab", then a word, again a "tab" and then an English word. I have to know just the number of unique words of the sentences (the words after tabs should not be in calculation). For that I changed the file to a list, so I have a list that contains a lot of lines and each line contains three indices; the sentence, a Persian word, an English word. Now I can achieve the sentences. The problem is that, the code I wrote returns the number of unique words of each line separately. For example if the file has 100 lines it returns 100 numbers, each in a new line. But I want the summation of all the numbers and have just one number which shows the total number of unique words. How can I change the code?

from hazm import*

def WordsProbs (file):
with open (file, encoding = "utf-8") as f1:
normalizer = Normalizer()
for line in f1:
tmp = line.strip().split("\t")
tmp[0] = normalizer.normalize(tmp[0])
corpus.append(tmp)
for row in corpus:
UniqueWords = len(set(row[0].split()))
print (UniqueWords)


The sample data:

باد بارش برف وزش باد، کولاک یخبندان سطح wind

Answer

Assuming tmp[0] contains the sentence from each line, the individual words in the sentence can be counted without building a corpus.

from hazm import*

def WordsProbs (file):
    words = set()
    with open (file, encoding = "utf-8") as f1:
        normalizer = Normalizer()
        for line in f1:
            tmp = line.strip().split("\t")
            words.update(set(normalizer.normalize(tmp[0].split())))
    print(len(words), "unique words")

I can't test it because on my machine, the English word "wind" shows up in the first column after cutting and pasting your sample data.