user1140126 user1140126 - 7 months ago 7
Python Question

How to find a substring using partial matching

bigString = "AGAHKGHKHASNHADKRGHFKXXX_I_AM_THERE_XXXXXMHHGRFSAHGSKHASGKHGKHSKGHAK"
smallString = "I_AM_HERE"


Which efficient algorithm should I use to find a substring of the "bigString" that matches closely to the "smallString"

output = "I_AM_THERE"


The output may have few insertions and deletions when compared with small string.

Edit:
Found a good example, very close to my problem here: How to add variable error to regex fuzzy search. Python

Answer

You can use the almost-ready-to-be-everyones-regex package with fuzzy matching:

>>> import regex
>>> bigString = "AGAHKGHKHASNHADKRGHFKXXX_I_AM_THERE_XXXXXMHHGRFSAHGSKHASGKHGKHSKGHAK"
>>> regex.search('(?:I_AM_HERE){e<=1}',bigString).group(0)
'I_AM_THERE'

Or:

>>> bigString = "AGAH_I_AM_HERE_RGHFKXXX_I_AM_THERE_XXX_I_AM_NOWHERE_EREXXMHHGRFS"
>>> print(regex.findall('I_AM_(?:HERE){e<=3}',bigString))
['I_AM_HERE', 'I_AM_THERE', 'I_AM_NOWHERE']

The new regex module will (hopefully) be part of Python3.4

If you have pip, just type pip install regex or pip3 install regex until Python 3.4 is out (with regex part of it...)


Answer to comment Is there a way to know the best out of the three in your second example? How to use BESTMATCH flag here?

Either use the best match flag (?b) to get the single best match:

print(regex.search(r'(?b)I_AM_(?:ERE){e<=3}', bigString).group(0))
# I_AM_THE

Or combine with difflib or take a levenshtein distance with a list of all acceptable matches to the first literal:

import regex

def levenshtein(s1,s2):
    if len(s1) > len(s2):
        s1,s2 = s2,s1
    distances = range(len(s1) + 1)
    for index2,char2 in enumerate(s2):
        newDistances = [index2+1]
        for index1,char1 in enumerate(s1):
            if char1 == char2:
                newDistances.append(distances[index1])
            else:
                newDistances.append(1 + min((distances[index1],
                                             distances[index1+1],
                                             newDistances[-1])))
        distances = newDistances
    return distances[-1]

bigString = "AGAH_I_AM_NOWHERE_HERE_RGHFKXXX_I_AM_THERE_XXX_I_AM_HERE_EREXXMHHGRFS"
cl=[(levenshtein(s,'I_AM_HERE'),s) for s in regex.findall('I_AM_(?:HERE){e<=3}',bigString)]

print(cl)
print([t[1] for t in sorted(cl, key=lambda t: t[0])])

print(regex.search(r'(?e)I_AM_(?:ERE){e<=3}', bigString).group(0))

Prints:

[(3, 'I_AM_NOWHERE'), (1, 'I_AM_THERE'), (0, 'I_AM_HERE')]
['I_AM_HERE', 'I_AM_THERE', 'I_AM_NOWHERE']
Comments