neversaint neversaint - 4 months ago 11
Python Question

Find overlapping and non-overlapping part of two strings

I have two examples of pair of strings

YHFLSPYVY # answer
LSPYVYSPR # prediction
+++******ooo


YHFLSPYVS # answer
VEYHFLSPY # prediction
oo*******++


As stated above I'd like to find the overlapping region (
*
) and non-overlapping region in answer (
+
) and prediction (
o
).

How can I do it in Python?

I'm stuck with this

import re
# This is of example 1
ans = "YHFLSPYVY"
pred= "LSPYVYSPR"
matches = re.finditer(r'(?=(%s))' % re.escape(pred), ans)
print [m.start(1) for m in matches]
#[]


The answer I hope to get for example 1 is:

plus_len = 3
star_len = 6
ooo_len = 3

Answer

It's easy with difflib.SequenceMatcher.find_longest_match:

from difflib import SequenceMatcher

def f(answer, prediction):
    sm = SequenceMatcher(a=answer, b=prediction)
    match = sm.find_longest_match(0, len(answer), 0, len(prediction))
    star_len = match.size
    plus_len = len(answer[:match.a] + answer[match.a + match.size:])
    ooo_len = len(prediction[:match.b] + prediction[match.b + match.size:])
    return (plus_len, star_len, ooo_len)

f('YHFLSPYVY', 'LSPYVYSPR') # (3, 6, 3)
f('YHFLSPYVS', 'VEYHFLSPY') # (2, 7, 2)
Comments