Kvothe Kvothe - 2 months ago 28
Python Question

Python Fuzzy Matching (FuzzyWuzzy) - Keep only Best Match

I'm trying to fuzzy match two csv files, each containing one column of names, that are similar but not the same.

My code so far is as follows:

import pandas as pd
from pandas import DataFrame
from fuzzywuzzy import process
import csv

save_file = open('fuzzy_match_results.csv', 'w')
writer = csv.writer(save_file, lineterminator = '\n')

def parse_csv(path):

with open(path,'r') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
yield row


if __name__ == "__main__":
## Create lookup dictionary by parsing the products csv
data = {}
for row in parse_csv('names_1.csv'):
data[row[0]] = row[0]

## For each row in the lookup compute the partial ratio
for row in parse_csv("names_2.csv"):
#print(process.extract(row,data, limit = 100))
for found, score, matchrow in process.extract(row, data, limit=100):
if score >= 60:
print('%d%% partial match: "%s" with "%s" ' % (score, row, found))
Digi_Results = [row, score, found]
writer.writerow(Digi_Results)


save_file.close()


The output is as follows:

Name11 , 90 , Name25
Name11 , 85 , Name24
Name11 , 65 , Name29


The script works fine. The output is as expected.
But what I am looking for is only the best match.

Name11 , 90 , Name25
Name12 , 95 , Name21
Name13 , 98 , Name22


So I need to somehow drop the duplicated names in column 1, based on the highest value in column 2.
It should be fairly straightforward, but I can't seem to figure it out.
Any help would be appreciated.

Answer

fuzzywuzzy's process.extract() returns the list in reverse sorted order , with the best match coming first.

so to find just the best match, you can set the limit argument as 1 , so that it only returns the best match, and if that is greater than 60 , you can write it to the csv, like you are doing now.

Example -

## For each row in the lookup compute the partial ratio
for row in parse_csv("names_2.csv"):

    for found, score, matchrow in process.extract(row, data, limit=1):
        if score >= 60:
            print('%d%% partial match: "%s" with "%s" ' % (score, row, found))
            Digi_Results = [row, score, found]
            writer.writerow(Digi_Results)
Comments