free_mind - 1 year ago 58

Python Question

I have two list, which we can call

`A`

`B`

`A`

`B`

`A`

Example of content in A:

`https://some/path`

http://another/path

http://another.some/path

Example of content in B:

`http://another/path`

http://this/wont/match/anything

Currently I'm doing this:

`def check_comps(self, comps):`

for a in self.A:

for b in comps:

if b.startswith(a):

return a

Is there a better way to do this?

Answer Source

Your solution has the worst-case O(nm) time complexity, that is O(n^2) if n ~ m. You can easily reduce it to O(n log(n)) and even O(log(n)). Here is how.

Consider a list of words (your `comps`

attrubute) and a target (your `b`

)

```
words = ['abdc', 'abd', 'acb', 'abcabc', 'abc']
target = "abcd"
```

Observe, that by sorting the list of words in lexicographical order, you get a list of prefixes

```
prefixes = ['abc', 'abcabc', 'abd', 'abdc', 'acb']
```

It is degenerate, because `prefixes[0]`

is a prefix of `prefixes[1]`

, hence everything that starts with `prefixes[1]`

starts with `prefixes[0]`

just as well. This is a bit problematic. Let's see why. Let's use the fast (binary) search to find the proper place of the target in the `prefix`

list.

```
import bisect
bisect.bisect(prefixes, target) # -> 2
```

This is because the `target`

and `prefixes[1]`

share a prefix, but `target[3] > prefixes[1][3]`

, hence lexicographically it should go after. Hence, if there is a prefix of the `target`

in the `prefixes`

, it should be to the left of index `2`

. Obviously, the `target`

doesn't start with `prefixes[1]`

hence in the worst case we would have to search all the way to the left to find whether there is a prefix. Now observe, that if we transform these `prefixes`

into a nondegenerate list, the only possible prefix of a target will always be to the left of the position returned by `bisect.bisect`

. Let's reduce the list of prefixes and write a helper function that will check whether there is a prefix of a target.

```
from functools import reduce
def minimize_prefixes(prefixes):
"""
Note! `prefixes` must be sorted lexicographically !
"""
def accum_prefs(prefixes, prefix):
if not prefix.startswith(prefixes[-1]):
return prefixes.append(prefix) or prefixes
return prefixes
prefs_iter = iter(prefixes)
return reduce(accum_prefs, prefs_iter, [next(prefs_iter)]) if prefixes else []
def hasprefix(minimized_prefixes, target):
position = bisect.bisect(minimized_prefixes, target)
return target.startswith(minimized_prefixes[position-1]) if position else False
```

Now let's see

```
min_prefixes = minimize_prefixes(prefixes)
print(min_prefixes) # -> ['abc', 'abd', 'acb']
hasprefix(min_prefixes, target) # -> True
```

Let's make a test that must fail:

```
min_prefs_fail = ["abcde"]
hasprefix(min_prefs_fail, target) # -> False
```

This way you get O(n log(n)) search, which is asymptotically faster than your O(n^2) solution. Note! You can (and you really should) store the `minimize_prefixes(sorted(comps))`

prefix set as an attribute in your object, making any prefix search O(log (n)), which is even more faster than what you have now.