HonzaB HonzaB - 1 year ago 91
Python Question

Slow fuzzy matching between two DataFrames

I have DataFrame A (

) with cli id and origin:

cli id | origin
123 | 1234 M-MKT XYZklm 05/2016

And DataFrame B (
) with shortcut and campaign

shortcut | campaign
M-MKT | Mobile Marketing Outbound

I know that for example client with origin
1234 M-MKT XYZklm 05/2016
is actually from campaign
Mobile Marketing Outbound
because it contains key word

Note that shortcut is a general key word, based on what the algorithm should decide. The origin can as well be
. I created the list of shortcuts manually by analyzing all the origins in the first place. I am also using regex to clean the
before it is fetched to the program.

I'd like to match customer origin with campaign through the shortcut and attach score to see the difference. As showed below:

cli id | shortcut | origin | campaign | Score
123 | M-MKT | 1234 M-MKT XYZklm 05/2016 | Mobile Marketing Outbound | 0.93

Below is my program which works, but is really slow. DataFrame A has ~400.000 rows and another DataFrame B has ~40 rows.

Is there a way I can make it faster?

from fuzzywuzzy import fuzz
list_values = df_dict['Shortcut'].values.tolist()

def TopFuzzMatch(tokenA, dict_, position, value):
Calculates similarity between two tokens and returns TOP match and score
tokenA: similarity to this token will be calculated
dict_a: list with shortcuts
position: whether I want first, second, third...TOP position
value: 0=similarity score, 1=associated shortcut
sim = [(fuzz.token_sort_ratio(x, tokenA),x) for x in dict_]
sim.sort(key=lambda tup: tup[0], reverse=True)
return sim[position][value]

df_cam['1st_choice_short'] = df_cam.apply(lambda x: TopFuzzMatch(x['cli_origin'],list_values,0,1), axis=1 )
df_cam['1st_choice_sim'] = df_cam.apply(lambda x: TopFuzzMatch(x['cli_origin'],list_values,0,0), axis=1 )

Note that I'd like to calculate also 2nd and 3rd best match to evaluate the accuracy.


I have found
method, but the speed remains same.
So my code looks like this now:

def TopFuzzMatch(token, dict_, value):
score = process.extractOne(token, dict_, scorer=fuzz.token_sort_ratio)
return score[value]

Answer Source

I found a solution - after i clean the origin column with regex (no numbers and special characters), there are just a few hundred repeating distinct values, so I calculate the Fuzz algorithm just on those, which significantly improves the time.

def TopFuzzMatch(df_cam, df_dict):
    Calculates similarity bewteen two tokens and return TOP match
    The idea is to do it only over distinct values in given DF (takes ages otherwise)
    df_cam: DataFrame with client id and origin
    df_dict: DataFrame with abbreviation which is matched with the description i need
    #Clean special characters and numbers
    df_cam['clean_camp'] = df_cam.apply(lambda x: re.sub('[^A-Za-z]+', '',x['origin']), axis=1)

    #Get unique values and calculate similarity
    uq_origin = np.unique(df_cam['clean_camp'].values.ravel())
    top_match = [process.extractOne(x, df_dict['Shortcut'])[0] for x in uq_origin]

    #To DataFrame
    df_match = pd.DataFrame({'unique': uq_origin})
    df_match['top_match'] = top_match

    df_cam = pd.merge(df_cam, df_match, how = 'left', left_on = 'clean_camp', right_on = 'unique')
    df_cam = pd.merge(df_cam, df_dict, how = 'left', left_on = 'top_match', right_on = 'Shortcut')

    return df_cam

df_out = TopFuzzMatch(df_cam, df_dict)
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download