David Chan Chi Fung David Chan Chi Fung - 1 month ago 9
Python Question

Python: speed up my looping by avoiding repeating the same comparison

I am working on an algorithm that checks similarity between English words.

Having defined a function called 'similarity', I go though the whole list of words to check for similar words, if the similarity between two words is very high (=1) the algorithm will change one of the two words to the other one.

Here is the logic:

list_of_word = [word1, word2, word3, word4]


assume that there is a very high similarity between word1 and word4.

Result:

list_of_word = [word1, word2, word3, word1]


normally, I just need to loop it, the steps taken will be like:


  1. take word1 out, comparing to word1, word2, word3, word4

  2. take word2 out, comparing to word1, word2, word3, word4

  3. take word3 out, comparing to word1, word2, word3, word4

  4. take word4 out, comparing to word1, word2, word3, word4



However, there are some useless and repeated actions. For example, I don't have to compare word1 and word2 more than once.

The problem is that I have to go through 1 million words and it might take many days to run.

Any advice?

Here is the code I am using at the moment:

from nltk.corpus import wordnet as wn
from itertools import product

def similarity(wordx,wordy):
sem1, sem2= wn.synsets(wordx), wn.synsets(wordy)
maxscore = 0
for i,j in list(product(*[sem1,sem2])):
score = i.path_similarity(j) # Wu-Palmer Similarity
maxscore = score if maxscore < score else maxscore
return maxscore

def group_high_similarity(target_list,tc):
result = target_list[:]
num_word = len(result)
for word in result:
wordx = word
i = 0
while i<len(result):
wordy = result[i]
value = similarity(wordx,wordy)
if value >= tc:
result[i] = wordx
if wordy != wordx :print wordy+"---> "+ wordx
i += 1
return result

Answer

Assume all your word list are not repetitive (means you already put them into set)

IMHO, you can apply set theory math in similarity.

If A similar to B, while X also similar to B, it means A also similar to X.

So you have a set of words ["car", "bus", "cat", "dog", "pen", "duck", "motorcycle"]

As in similarity attribute of "motor vehicle, if "car" similar to "bus", and "car" similar to "motorcycle". Thus "bus" also similar to "motorcycle". So you can see, you don't need to compare all similar word that have been found. So after the "car" similarity compare is done, it already taken away ["car", "bus", "motorcycle"]. "bus", "motorcycle" don't need to use for comparison again.

you only remains ["cat", "dog", "pen", "duck"] ,etc.

What you need to do next is keeping an index where this similar located. Perhaps a second pass to check the distance score.

(update) IMPORTANT NOTE : In natural language, same verb and noun can have multiple meaning, e.g. chicken may means coward. E.g. you may miss combined words, proverb,etc. E.g. chicken out has nothing to do with chicken going out; In vitro is a verb that you can't split them up. Above method are pretty aggressive. But you need to start from somewhere, then incrementally add more features to perfect them.