HonzaB HonzaB - 9 days ago 6
Python Question

Slow fuzzy matching between two DataFrames

I have DataFrame A (

df_cam
) with cli id and origin:

cli id | origin
------------------------------------
123 | 1234 M-MKT XYZklm 05/2016


And DataFrame B (
df_dict
) with shortcut and campaign

shortcut | campaign
------------------------------------
M-MKT | Mobile Marketing Outbound


I know that for example client with origin
1234 M-MKT XYZklm 05/2016
is actually from campaign
Mobile Marketing Outbound
because it contains key word
M-MKT
.

Note that shortcut is a general key word, based on what the algorithm should decide. The origin can as well be
M-Marketing
,
MMKT
or
Mob-MKT
. I created the list of shortcuts manually by analyzing all the origins in the first place. I am also using regex to clean the
origin
before it is fetched to the program.


I'd like to match customer origin with campaign through the shortcut and attach score to see the difference. As showed below:

cli id | shortcut | origin | campaign | Score
---------------------------------------------------------------------------------
123 | M-MKT | 1234 M-MKT XYZklm 05/2016 | Mobile Marketing Outbound | 0.93


Below is my program which works, but is really slow. DataFrame A has ~400.000 rows and another DataFrame B has ~40 rows.

Is there a way I can make it faster?

from fuzzywuzzy import fuzz
list_values = df_dict['Shortcut'].values.tolist()

def TopFuzzMatch(tokenA, dict_, position, value):
"""
Calculates similarity between two tokens and returns TOP match and score
-----------------------------------------------------------------------
tokenA: similarity to this token will be calculated
dict_a: list with shortcuts
position: whether I want first, second, third...TOP position
value: 0=similarity score, 1=associated shortcut
-----------------------------------------------------------------------
"""
sim = [(fuzz.token_sort_ratio(x, tokenA),x) for x in dict_]
sim.sort(key=lambda tup: tup[0], reverse=True)
return sim[position][value]

df_cam['1st_choice_short'] = df_cam.apply(lambda x: TopFuzzMatch(x['cli_origin'],list_values,0,1), axis=1 )
df_cam['1st_choice_sim'] = df_cam.apply(lambda x: TopFuzzMatch(x['cli_origin'],list_values,0,0), axis=1 )


Note that I'd like to calculate also 2nd and 3rd best match to evaluate the accuracy.

EDIT

I have found
process.ExtractOne
method, but the speed remains same.
So my code looks like this now:

def TopFuzzMatch(token, dict_, value):
score = process.extractOne(token, dict_, scorer=fuzz.token_sort_ratio)
return score[value]

Answer

I found a solution - after i clean the origin column with regex (no numbers and special characters), there are just a few hundred repeating distinct values, so I calculate the Fuzz algorithm just on those, which significantly improves the time.

def TopFuzzMatch(df_cam, df_dict):
    """
    Calculates similarity bewteen two tokens and return TOP match
    The idea is to do it only over distinct values in given DF (takes ages otherwise)
    -----------------------------------------------------------------------
    df_cam: DataFrame with client id and origin
    df_dict: DataFrame with abbreviation which is matched with the description i need
    -----------------------------------------------------------------------
    """
    #Clean special characters and numbers
    df_cam['clean_camp'] = df_cam.apply(lambda x: re.sub('[^A-Za-z]+', '',x['origin']), axis=1)

    #Get unique values and calculate similarity
    uq_origin = np.unique(df_cam['clean_camp'].values.ravel())
    top_match = [process.extractOne(x, df_dict['Shortcut'])[0] for x in uq_origin]

    #To DataFrame
    df_match = pd.DataFrame({'unique': uq_origin})
    df_match['top_match'] = top_match

    #Merge
    df_cam = pd.merge(df_cam, df_match, how = 'left', left_on = 'clean_camp', right_on = 'unique')
    df_cam = pd.merge(df_cam, df_dict, how = 'left', left_on = 'top_match', right_on = 'Shortcut')

    return df_cam

df_out = TopFuzzMatch(df_cam, df_dict)