Frankie Frankie - 2 months ago 13
Python Question

Find all common contiguous substrings of two strings in python

I have two strings and I want to find all the common words. For example,

s1 = 'Today is a good day, it is a good idea to have a walk.'

s2 = 'Yesterday was not a good day, but today is good, shall we have a walk?'


Consider s1 matches s2

'Today is' matches 'today is' but 'Today is a' does not match any characters in s2. Therefore, 'Today is' is one of the common consecutive characters. Similarly, we have 'a good day', 'is', 'a good', 'have a walk'. So the common words are

common = ['today is', 'a good day', 'is', 'a good', 'have a walk']


Can we use regular expression to do that?

Thank you very much.

Answer Source

Took reference from Find common substring between two strings

Modified few lines and added few lines Modification is default return of answer = "NULL" if not found any substring .

Added keep on searching until you get NULL and store to List

def longestSubstringFinder(string1, string2):
    answer = "NULL"
    len1, len2 = len(string1), len(string2)
    for i in range(len1):
        match = ""
        for j in range(len2):
            if (i + j < len1 and string1[i + j] == string2[j]):
                match += string2[j]
            else:
                if (len(match) > len(answer)): answer = match
                match = ""
    return answer


mylist = []

def call():
    s1 = 'Today is a good day, it is a good idea to have a walk.'

    s2 = 'Yesterday was not a good day, but today is good, shall we have a walk?'
    s1 =  s1.lower()
    s2 = s2.lower()
    x = longestSubstringFinder(s2,s1)
    while(longestSubstringFinder(s2,s1) != "NULL"): 
        x = longestSubstringFinder(s2,s1)
        print(x)
        mylist.append(x)
        s2 = s2.replace(x,' ')


print ('[%s]' % ','.join(map(str, mylist)))

Output

[ a good day, , have a walk,today is , good]

Difference in your output

common = ['today is', 'a good day', 'is', 'a good', 'have a walk']

Your expectation for second "is" wrong as you see in s2 there is only one "is"