free_mind free_mind - 2 months ago 4x
Python Question

Check if portions from list A are in list B

I have two list, which we can call

. I need to check the items in list
and see if items in
starts with an item from
and then stop the check.

Example of content in A:


Example of content in B:


Currently I'm doing this:

def check_comps(self, comps):
for a in self.A:
for b in comps:
if b.startswith(a):
return a

Is there a better way to do this?


Your solution has the worst-case O(nm) time complexity, that is O(n^2) if n ~ m. You can easily reduce it to O(n log(n)) and even O(log(n)). Here is how.

Consider a list of words (your comps attrubute) and a target (your b)

words = ['abdc', 'abd', 'acb', 'abcabc', 'abc']
target = "abcd"

Observe, that by sorting the list of words in lexicographical order, you get a list of prefixes

prefixes = ['abc', 'abcabc', 'abd', 'abdc', 'acb']

It is degenerate, because prefixes[0] is a prefix of prefixes[1], hence everything that starts with prefixes[1] starts with prefixes[0] just as well. This is a bit problematic. Let's see why. Let's use the fast (binary) search to find the proper place of the target in the prefix list.

import bisect

bisect.bisect(prefixes, target)  #  -> 2

This is because the target and prefixes[1] share a prefix, but the target is longer, hence lexicographically it should go after. Hence, if there is a prefix of the target in the prefixes, it should be to the left of index 2. Obviously, the target doesn't start with prefixes[1] hence in the worst case we would have to search all the way to the left to find whether there is a prefix. Now observe, that if we transform these prefixes into a nondegenerate list, the only possible prefix of a target will always be to the left of the position returned by bisect.bisect. Let's reduce the list of prefixes and write a helper function that will check whether there is a prefix of a target.

from functools import reduce

def minimize_prefixes(prefixes):
    Note! `prefixes` must be sorted lexicographically !
    def accum_prefs(prefixes, prefix):
        if not prefix.startswith(prefixes[-1]):
            return prefixes.append(prefix) or prefixes
        return prefixes
    prefs_iter = iter(prefixes)
    return reduce(accum_prefs, prefs_iter, [next(prefs_iter)]) if prefixes else []

def hasprefix(minimized_prefixes, target):
    position = bisect.bisect(minimized_prefixes, target)
    return target.startswith(minimized_prefixes[position-1]) if position else False

Now let's see

min_prefixes = minimize_prefixes(prefixes)
print(min_prefixes)  # -> ['abc', 'abd', 'acb']
hasprefix(min_prefixes, target)  # -> True

Let's make a test that must fail:

min_prefs_fail = ["abcde"]
hasprefix(min_prefs_fail, target)  # -> False

This way you get O(n log(n)) search, which is asymptotically faster than your O(n^2) solution. Note! You can (and you really should) store the minimize_prefixes(sorted(comps)) prefix set as an attribute in your object, making any prefix search O(log (n)), which is even more faster than what you have now.