I have a file that consists of many Persian sentences. each line contains a sentence, then a "tab", then a word, again a "tab" and then an English word. I have to know just the number of unique words of the sentences (the words after tabs should not be in calculation). For that I changed the file to a list, so I have a list that contains a lot of lines and each line contains three indices; the sentence, a Persian word, an English word. Now I can achieve the sentences. The problem is that, the code I wrote returns the number of unique words of each line separately. For example if the file has 100 lines it returns 100 numbers, each in a new line. But I want the summation of all the numbers and have just one number which shows the total number of unique words. How can I change the code?
from hazm import*
def WordsProbs (file):
with open (file, encoding = "utf-8") as f1:
normalizer = Normalizer()
for line in f1:
tmp = line.strip().split("\t")
tmp = normalizer.normalize(tmp)
for row in corpus:
UniqueWords = len(set(row.split()))
Assuming tmp contains the sentence from each line, the individual words in the sentence can be counted without building a corpus.
from hazm import* def WordsProbs (file): words = set() with open (file, encoding = "utf-8") as f1: normalizer = Normalizer() for line in f1: tmp = line.strip().split("\t") words.update(set(normalizer.normalize(tmp.split()))) print(len(words), "unique words")
I can't test it because on my machine, the English word "wind" shows up in the first column after cutting and pasting your sample data.