Huita Huita - 3 months ago 24
Python Question

Python difflib: sequence similarity above cutoff point, but no result on get_close_matches()

So i'm using difflib to find same streets written down in different formats. Here's the one pair that really bugs me: '1-й Лихачевский переулок' and 'Переулок Лихачевский 1-й'.

I calculate the sequence similarity like this:

s = difflib.SequenceMatcher(None, "1-й Лихачевский переулок", "Переулок Лихачевский 1-й")

Gives me result of 0.5416666666666666. Good enough, eh? But okay, default cutoff point for get_close_matches() is 0.6, so i do this:

difflib.get_close_matches('1-й Лихачевский переулок', 'Переулок Лихачевский 1-й', cutoff=0.5)

No results! In fact, there's no results even if i set cutoff to 0.1.

What am i missing?

Answer Source

The second argument to get_close_matches() is a sequence of strings to match against, not an individual string. So, e.g., pass a list:

>>> difflib.get_close_matches('1-й Лихачевский переулок', ['Переулок Лихачевский 1-й'], cutoff=0.5)
['Переулок Лихачевский 1-й']

As is, you passed a string, which is treated as a sequence of individual characters.