safwan safwan - 6 days ago 5
Python Question

how to compare each element in a list with each element in another list?

i want to compare a list of extracted promo codes with a list of correct promo codes.

if the promo code in the extracted_list which is being compared with the promo codes in the correct_promo_code list does not find an exact match then it means that promo code has errors. in order to find the correct promo code from the correct_promo_codes list, i need to find the promo code with least edit distance (levenshtein distance) with the one being compared (from the extracted_list).

code till now:-

import csv

with open("all_correct_promo.csv","rb") as file1:
reader1 = csv.reader(file1)
correctPromoList = list(reader1)
#print correctPromoList

with open("all_extracted_promo.csv","rb") as file2:
reader2 = csv.reader(file2)
extractedPromoList = list(reader2)
#print extractedPromoList

incorrectPromo = []
count = 0
for extracted in extractedPromoList:
if(extracted not in correctPromoList):
incorrectPromo.append(extracted)
else:
count = count + 1
#print incorrectPromo

for promos in incorrectPromo:
print promos

Answer

According to nltk docs

nltk.metrics.distance.edit_distance(s1, s2, transpositions=False)

calculates the Levenshtein edit-distance between two strings. The edit distance is the number of characters that need to be substituted, inserted, or deleted, to transform s1 into s2. For example, transforming “rain” to “shine” requires three steps, consisting of two substitutions and one insertion: “rain” -> “sain” -> “shin” -> “shine”. These operations could have been done in other orders, but at least three steps are needed.

Coming to your Code, I think some changes in the bottom half will capture the edit distance -

from nltk.metrics import distance # slow to load

extractedPromoList = ['abc','acd','abd'] # csv of extracted promo codes dummy
correctPromoList = ['abc','aba','xbz','abz','abx'] # csv to real promo codes dummy

def find_min_edit(str_,list_):
    nearest_correct_promos = []
    distances = {}
    min_dist = 100 # arbitrary large assignment
    for correct_promo in list_:
        dist = distance.edit_distance(extracted,correct_promo,True) # compute Levenshtein distance
        distances[correct_promo] = dist # store each score for real promo codes
        if dist<min_dist:
            min_dist = dist # store min distance
    # extract all real promo codes with minimum Levenshtein distance
    nearest_correct_promos.append(','.join([i[0] for i in distances.items() if i[1]==min_dist])) 
    return ','.join(nearest_correct_promos) # return a comma separated string of nearest real promo codes

incorrectPromo = {}
count = 0
for extracted in extractedPromoList:
    print 'Computing %dth promo code...' % count
    incorrectPromo[extracted] =  find_min_edit(extracted,correctPromoList) # get comma separated str of real promo codes nearest to extracted
    count+=1
print incorrectPromo

Output

Computing 0th promo code...
Computing 1th promo code...
Computing 2th promo code...
{'abc': 'abc', 'abd': 'abx,aba,abz,abc', 'acd': 'abx,aba,abz,abc'}
Comments