 Elisha512 -4 years ago 136
Python Question

# How to find the longest common substring of multiple strings?

I am writing a python script where I have multiple strings.

For example:

``````x = "brownasdfoersjumps"
y = "foxsxzxasis12sa[[#brown"
z = "thissasbrownxc-34a@s;"
``````

In all these three strings, they have one sub string in common which is
`brown`
. I want to search it in a way that I want to create a dictionary as:

``````dict = {[commonly occuring substring] =>
[total number of occurrences in the strings provided]}
``````

What would be the best way of doing that? Considering that I will have more than 200 strings each time, what would be an easy/efficient way of doing it? Eli Korvigo

This is a relatively optimised naïve algorithm. You first transform each sequence into a set of all its ngrams. Then you intersect all sets and find the longest ngram in the intersection.

``````from functools import partial, reduce
from itertools import chain
from typing import Iterator

def ngram(seq: str, n: int) -> Iterator[str]:
return (seq[i: i+n] for i in range(0, len(seq)-n+1))

def allngram(seq: str) -> Iterator[str]:
lengths = range(len(seq))
ngrams = map(partial(ngram, seq), lengths)
return set(chain.from_iterable(ngrams))

sequences = ["brownasdfoersjumps",
"foxsxzxasis12sa[[#brown",
"thissasbrownxc-34a@s;"]

seqs_ngrams = map(allngram, sequences)
intersection = reduce(set.intersection, seqs_ngrams)
longest = max(intersection, key=len) # -> brown
``````

While this might get you through short sequences, this algorithm is extremely inefficient on long sequences. If your sequences are long, you can add a heuristic to limit the largest possible ngram length (i.e. the longest possible common substring). One obvious value for such a heuristic may be the shortest sequence's length.

``````def allngram(seq: str, minn=1, maxn=None) -> Iterator[str]:
lengths = range(minn, maxn) if maxn else range(minn, len(seq))
ngrams = map(partial(ngram, seq), lengths)
return set(chain.from_iterable(ngrams))

sequences = ["brownasdfoersjumps",
"foxsxzxasis12sa[[#brown",
"thissasbrownxc-34a@s;"]

maxn = min(map(len, sequences))
seqs_ngrams = map(partial(allngram, maxn=maxn), sequences)
intersection = reduce(set.intersection, seqs_ngrams)
longest = max(intersection, key=len)  # -> brown
``````

This may still take too long (or make your machine run out of RAM), so you might want to read about some optimal algorithms (see the link I left in my comment to your question).

Update

To count the number of strings wherein each ngram occurs

``````from collections import Counter
sequences = ["brownasdfoersjumps",
"foxsxzxasis12sa[[#brown",
"thissasbrownxc-34a@s;"]

seqs_ngrams = map(allngram, sequences)
counts = Counter(chain.from_iterable(seqs_ngrams))
``````

`Counter` is a subclass of `dict`, so its instances have similar interfaces:

``````print(counts)
Counter({'#': 1,
'#b': 1,
'#br': 1,
'#bro': 1,
'#brow': 1,
'#brown': 1,
'-': 1,
'-3': 1,
'-34': 1,
'-34a': 1,
'-34a@': 1,
'-34a@s': 1,
'-34a@s;': 1,
...
``````

You can filter the counts to leave substrings occurring in at least `n` strings: `{string: count for string, count in counts.items() if count >= n}`

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download