Luis Ramon Ramirez Rodriguez Luis Ramon Ramirez Rodriguez - 4 months ago 26
Python Question

Compare similarity between names

I have to make a cross-validation for some data based on names.

The problem I'm facing is that depending on the source, names have slight variations, for example:

L & L AIR CONDITIONING vs L & L AIR CONDITIONING Service

BEST ROOFING vs ROOFING INC


I have several thousands of records so do it manually will be very time demanding, I want to automate the process as much as possible.

Since there are additional words it wouldn't be enough to lowercase the names.

Which are good algorithms to handle this?

Maybe to calculate the correlation giving low weight to words like 'INC' or 'Service'

Edit:

I tried the difflib library

difflib.SequenceMatcher(None,name_1.lower(),name_2.lower()).ratio()


I'm getting a decent result with it.

Answer

I would use cosine similarity to achieve the same. It will give you a matching score of how close the strings are.

Here is the code to help you with the same (I remember getting this code from Stackoverflow itself, some months ago - couldn't find the link now)

import re, math
from collections import Counter

WORD = re.compile(r'\w+')

def get_cosine(vec1, vec2):
    # print vec1, vec2
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x]**2 for x in vec1.keys()])
    sum2 = sum([vec2[x]**2 for x in vec2.keys()])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator

def text_to_vector(text):
    return Counter(WORD.findall(text))

def get_similarity(a, b):
    a = text_to_vector(a.strip().lower())
    b = text_to_vector(b.strip().lower())

    return get_cosine(a, b)

get_similarity('L & L AIR CONDITIONING', 'L & L AIR CONDITIONING Service') # returns 0.9258200997725514

Another version that I found useful was slightly NLP based and I authored it.

import re, math
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem.porter import *
from nltk.corpus import wordnet as wn

stop = stopwords.words('english')

WORD = re.compile(r'\w+')
stemmer = PorterStemmer()

def get_cosine(vec1, vec2):
    # print vec1, vec2
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x]**2 for x in vec1.keys()])
    sum2 = sum([vec2[x]**2 for x in vec2.keys()])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator

def text_to_vector(text):
    words = WORD.findall(text)
    a = []
    for i in words:
        for ss in wn.synsets(i):
            a.extend(ss.lemma_names())
    for i in words:
        if i not in a:
            a.append(i)
    a = set(a)
    w = [stemmer.stem(i) for i in a if i not in stop]
    return Counter(w)

def get_similarity(a, b):
    a = text_to_vector(a.strip().lower())
    b = text_to_vector(b.strip().lower())

    return get_cosine(a, b)

def get_char_wise_similarity(a, b):
    a = text_to_vector(a.strip().lower())
    b = text_to_vector(b.strip().lower())
    s = []

    for i in a:
        for j in b:
            s.append(get_similarity(str(i), str(j)))
    try:
        return sum(s)/float(len(s))
    except: # len(s) == 0
        return 0

get_similarity('I am a good boy', 'I am a very disciplined guy')
# Returns 0.5491201525567068

You can call both get_similarity or get_char_wise_similarity to see what works for your use case better. I used both - normal similarity to weed out really close ones, and then character wise similarity to weed out close enough ones. And then the remaining ones had to be dealt with manually.