I am working on an algorithm that checks similarity between English words.
Having defined a function called 'similarity', I go though the whole list of words to check for similar words, if the similarity between two words is very high (=1) the algorithm will change one of the two words to the other one.
Here is the logic:
list_of_word = [word1, word2, word3, word4]
list_of_word = [word1, word2, word3, word1]
from nltk.corpus import wordnet as wn
from itertools import product
sem1, sem2= wn.synsets(wordx), wn.synsets(wordy)
maxscore = 0
for i,j in list(product(*[sem1,sem2])):
score = i.path_similarity(j) # Wu-Palmer Similarity
maxscore = score if maxscore < score else maxscore
result = target_list[:]
num_word = len(result)
for word in result:
wordx = word
i = 0
wordy = result[i]
value = similarity(wordx,wordy)
if value >= tc:
result[i] = wordx
if wordy != wordx :print wordy+"---> "+ wordx
i += 1
Assume all your word list are not repetitive (means you already put them into set)
IMHO, you can apply set theory math in similarity.
If A similar to B, while X also similar to B, it means A also similar to X.
So you have a set of words ["car", "bus", "cat", "dog", "pen", "duck", "motorcycle"]
As in similarity attribute of "motor vehicle, if "car" similar to "bus", and "car" similar to "motorcycle". Thus "bus" also similar to "motorcycle". So you can see, you don't need to compare all similar word that have been found. So after the "car" similarity compare is done, it already taken away ["car", "bus", "motorcycle"]. "bus", "motorcycle" don't need to use for comparison again.
you only remains ["cat", "dog", "pen", "duck"] ,etc.
What you need to do next is keeping an index where this similar located. Perhaps a second pass to check the distance score.
(update) IMPORTANT NOTE : In natural language, same verb and noun can have multiple meaning, e.g. chicken may means coward. E.g. you may miss combined words, proverb,etc. E.g. chicken out has nothing to do with chicken going out; In vitro is a verb that you can't split them up. Above method are pretty aggressive. But you need to start from somewhere, then incrementally add more features to perfect them.