Luis Ramon Ramirez Rodriguez - 6 months ago 41

Python Question

I have to make a cross-validation for some data based on names.

The problem I'm facing is that depending on the source, names have slight variations, for example:

`L & L AIR CONDITIONING vs L & L AIR CONDITIONING Service`

BEST ROOFING vs ROOFING INC

I have several thousands of records so do it manually will be very time demanding, I want to automate the process as much as possible.

Since there are additional words it wouldn't be enough to lowercase the names.

Which are good algorithms to handle this?

Maybe to calculate the correlation giving low weight to words like 'INC' or 'Service'

Edit:

I tried the difflib library

`difflib.SequenceMatcher(None,name_1.lower(),name_2.lower()).ratio()`

I'm getting a decent result with it.

Answer

I would use cosine similarity to achieve the same. It will give you a matching score of how close the strings are.

Here is the code to help you with the same (I remember getting this code from Stackoverflow itself, some months ago - couldn't find the link now)

```
import re, math
from collections import Counter
WORD = re.compile(r'\w+')
def get_cosine(vec1, vec2):
# print vec1, vec2
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x]**2 for x in vec1.keys()])
sum2 = sum([vec2[x]**2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
def text_to_vector(text):
return Counter(WORD.findall(text))
def get_similarity(a, b):
a = text_to_vector(a.strip().lower())
b = text_to_vector(b.strip().lower())
return get_cosine(a, b)
get_similarity('L & L AIR CONDITIONING', 'L & L AIR CONDITIONING Service') # returns 0.9258200997725514
```

Another version that I found useful was slightly NLP based and I authored it.

```
import re, math
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem.porter import *
from nltk.corpus import wordnet as wn
stop = stopwords.words('english')
WORD = re.compile(r'\w+')
stemmer = PorterStemmer()
def get_cosine(vec1, vec2):
# print vec1, vec2
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x]**2 for x in vec1.keys()])
sum2 = sum([vec2[x]**2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
def text_to_vector(text):
words = WORD.findall(text)
a = []
for i in words:
for ss in wn.synsets(i):
a.extend(ss.lemma_names())
for i in words:
if i not in a:
a.append(i)
a = set(a)
w = [stemmer.stem(i) for i in a if i not in stop]
return Counter(w)
def get_similarity(a, b):
a = text_to_vector(a.strip().lower())
b = text_to_vector(b.strip().lower())
return get_cosine(a, b)
def get_char_wise_similarity(a, b):
a = text_to_vector(a.strip().lower())
b = text_to_vector(b.strip().lower())
s = []
for i in a:
for j in b:
s.append(get_similarity(str(i), str(j)))
try:
return sum(s)/float(len(s))
except: # len(s) == 0
return 0
get_similarity('I am a good boy', 'I am a very disciplined guy')
# Returns 0.5491201525567068
```

You can call both `get_similarity`

or `get_char_wise_similarity`

to see what works for your use case better. I used both - normal similarity to weed out really close ones, and then character wise similarity to weed out close enough ones. And then the remaining ones had to be dealt with manually.