free_mind - 1 year ago 59
Python Question

# Check if portions from list A are in list B

I have two list, which we can call

`A`
and
`B`
. I need to check the items in list
`A`
and see if items in
`B`
starts with an item from
`A`
and then stop the check.

Example of content in A:

``````https://some/path
http://another/path
http://another.some/path
``````

Example of content in B:

``````http://another/path
http://this/wont/match/anything
``````

Currently I'm doing this:

``````def check_comps(self, comps):
for a in self.A:
for b in comps:
if b.startswith(a):
return a
``````

Is there a better way to do this?

Your solution has the worst-case O(nm) time complexity, that is O(n^2) if n ~ m. You can easily reduce it to O(n log(n)) and even O(log(n)). Here is how.

Consider a list of words (your `comps` attrubute) and a target (your `b`)

``````words = ['abdc', 'abd', 'acb', 'abcabc', 'abc']
target = "abcd"
``````

Observe, that by sorting the list of words in lexicographical order, you get a list of prefixes

``````prefixes = ['abc', 'abcabc', 'abd', 'abdc', 'acb']
``````

It is degenerate, because `prefixes[0]` is a prefix of `prefixes[1]`, hence everything that starts with `prefixes[1]` starts with `prefixes[0]` just as well. This is a bit problematic. Let's see why. Let's use the fast (binary) search to find the proper place of the target in the `prefix` list.

``````import bisect

bisect.bisect(prefixes, target)  #  -> 2
``````

This is because the `target` and `prefixes[1]` share a prefix, but the target is longer, hence lexicographically it should go after. Hence, if there is a prefix of the `target` in the `prefixes`, it should be to the left of index `2`. Obviously, the `target` doesn't start with `prefixes[1]` hence in the worst case we would have to search all the way to the left to find whether there is a prefix. Now observe, that if we transform these `prefixes` into a nondegenerate list, the only possible prefix of a target will always be to the left of the position returned by `bisect.bisect`. Let's reduce the list of prefixes and write a helper function that will check whether there is a prefix of a target.

``````from functools import reduce

def minimize_prefixes(prefixes):
"""
Note! `prefixes` must be sorted lexicographically !
"""
def accum_prefs(prefixes, prefix):
if not prefix.startswith(prefixes[-1]):
return prefixes.append(prefix) or prefixes
return prefixes
prefs_iter = iter(prefixes)
return reduce(accum_prefs, prefs_iter, [next(prefs_iter)]) if prefixes else []

def hasprefix(minimized_prefixes, target):
position = bisect.bisect(minimized_prefixes, target)
return target.startswith(minimized_prefixes[position-1]) if position else False
``````

Now let's see

``````min_prefixes = minimize_prefixes(prefixes)
print(min_prefixes)  # -> ['abc', 'abd', 'acb']
hasprefix(min_prefixes, target)  # -> True
``````

Let's make a test that must fail:

``````min_prefs_fail = ["abcde"]
hasprefix(min_prefs_fail, target)  # -> False
``````

This way you get O(n log(n)) search, which is asymptotically faster than your O(n^2) solution. Note! You can (and you really should) store the `minimize_prefixes(sorted(comps))` prefix set as an attribute in your object, making any prefix search O(log (n)), which is even more faster than what you have now.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download