pocketfullofcheese pocketfullofcheese - 9 months ago 614
Python Question

is it possible to do fuzzy match merge with python pandas?

I have two DataFrames which I want to merge based on a column. However, due to alternate spellings, different number of spaces, absence/presence of diacritical marks, I would like to be able to merge as long as they are similar to one another.

Any similarity algorithm will do (soundex, Levenshtein, difflib's).

Say one DataFrame has the following data:

df1 = DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])

one 1
two 2
three 3
four 4
five 5

df2 = DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])

one a
too b
three c
fours d
five e

Then I want to get the resulting DataFrame

number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e


Similar to @locojay suggestion, you can apply difflib's get_closest_matches to df2's index and then apply a join:

In [23]: import difflib 

In [24]: difflib.get_close_matches
Out[24]: <function difflib.get_close_matches>

In [25]: df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])

In [26]: df2
one        a
two        b
three      c
four       d
five       e

In [31]: df1.join(df2)
       number letter
one         1      a
two         2      b
three       3      c
four        4      d
five        5      e


If these were columns, in the same vein you could apply to the column then merge:

df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name'])
df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name'])

df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])