Geraint Ballinger Geraint Ballinger - 5 months ago 46
Python Question

Python fuzzy element checking

I have a Python 2.7 set object that includes the names of data categories, and I would like to be able to do some form of fuzzy element checking to see if part of a user given input is an element of the set.

Here is a toy example, to explain what I would like. Given the following set and user input:

SET = {'red_ball', 'green_ball', 'red_cup', 'green_cup'}
user_input = 'yellow ball'


I would like the program to print out something like the following:

'yellow_ball' not found, did you mean 'red_ball', or 'green_ball'?


So far I have the following:

import re

SET = {'red_ball', 'green_ball', 'red_cup', 'green_cup'}
user_input = 'yellow ball'

# all members of my set are lowercase and separated by an underscore
user_input_list = user_input.lower().split() # for use in fuzzy search
user_input = "_".join(user_input_list) # convert to yellow_ball for element check
regex = None
matches = []

if user_input not in SET:
# FUZZY ELEMENT CHECK
for item in user_input_list:
regex = re.compile(item)
for element in SET:
if regex.match(element):
matches.append(element)

if len(matches) > 0:
print '\'%s\' not found, did you mean %s' % (user_input, ", ".join(['\'' + x + '\'' for x in matches]))
else:
print '\'%s\' not found.' % user_input


Is there a more efficient way of doing this, perhaps that uses third party libraries?

Thanks for your help,
Geraint

Answer Source

If you're interested in 3rd party modules, there's a nice little module I like to use for this sort of thing called fuzzywuzzy, for fuzzy string matching in Python.

This module performs fuzzy string matching with just a couple of lines of code.

Here's an example of how you can use it:

>>> from fuzzywuzzy import process
>>> choices = {'red_ball', 'green_ball', 'red_cup', 'green_cup'}
>>> query = 'yellow ball'

We've set up our choices and input, now we can retrieve the closest matches.

>>> process.extract(query, choices)
[('red_ball', 53), ('green_ball', 48), ('red_cup', 13), ('green_cup', 10)]

This returns all choices in descending order of closeness of match. The distance between strings is computed using the Levenshtein Distance metric. You can extract the top n items and propose them as valid alternatives if the original input isn't present in the choices set.

If you want only the top match, just do this:

>>> process.extractOne(query, choices)
('red_ball', 53)

You can peruse more examples using fuzzy string matching here.