fmark fmark - 24 days ago 15
Python Question

Test if lists share any items in python

I want to check if any of the items in one list are present in another list. I can do it simply with the code below, but I suspect there might be a library function to do this. If not, is there a more pythonic method of achieving the same result.

In [78]: a = [1, 2, 3, 4, 5]

In [79]: b = [8, 7, 6]

In [80]: c = [8, 7, 6, 5]

In [81]: def lists_overlap(a, b):
....: for i in a:
....: if i in b:
....: return True
....: return False
....:

In [82]: lists_overlap(a, b)
Out[82]: False

In [83]: lists_overlap(a, c)
Out[83]: True

In [84]: def lists_overlap2(a, b):
....: return len(set(a).intersection(set(b))) > 0
....:

Answer

Short answer: use set(a).isdisjoint(b), it's generally the fastest.

There are four common ways to test if two lists a and b share any items. The first option is to convert both to sets and check their intersection, as such:

bool(set(a) & set(b))

Because sets are stored using a hash table in Python, searching them is O(1) (see here for more information about complexity of operators in Python). Theoretically, this is O(n+m) on average for n and m objects in lists a and b. But 1) it must first create sets out of the lists, which can take a non-negligible amount of time, and 2) it supposes that hashing collisions are sparse among your data.

The second way to do it is using a generator expression performing iteration on the lists, such as:

any(i in a for i in b)

This allows to search in-place, so no new memory is allocated for intermediary variables. It also bails out on the first find. But the in operator is always O(n) on lists (see here).

Another proposed option is an hybridto iterate through one of the list, convert the other one in a set and test for membership on this set, like so:

a = set(a); any(i in a for i in b)

A fourth approach is to take advantage of the isdisjoint() method of the (frozen)sets (see here), for example:

set(a).isdisjoint(b)

If the elements you search are near the beginning of an array (e.g. it is sorted), the generator expression is favored, as the sets intersection method have to allocate new memory for the intermediary variables:

from timeit import timeit
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=list(range(1000))", number=100000)
26.077727576019242
>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=list(range(1000))", number=100000)
0.16220548999262974

Here's a graph of the execution time for this example in function of list size:

Element sharing test execution time when shared at the beginning

Note that both axes are logarithmic. This represents the best case for the generator expression. As can be seen, the isdisjoint() method is better for very small list sizes, whereas the generator expression is better for larger list sizes.

On the other hand, as the search begins with the beginning for the hybrid and generator expression, if the shared element are systematically at the end of the array (or both lists does not share any values), the disjoint and set intersection approaches are then way faster than the generator expression and the hybrid approach.

>>> timeit('any(i in a for i in b)', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))
13.739536046981812
>>> timeit('bool(set(a) & set(b))', setup="a=list(range(1000));b=[x+998 for x in range(999,0,-1)]", number=1000))
0.08102107048034668

Element sharing test execution time when shared at the end

It is interesting to note that the generator expression is way slower for bigger list sizes. This is only for 1000 repetitions, instead of the 100000 for the previous figure. This setup also approximates well when when no elements are shared, and is the best case for the disjoint and set intersection approaches.

Here are two analysis using random numbers (instead of rigging the setup to favor one technique or another):

Element sharing test execution time for randomly generated data with high chance of sharing Element sharing test execution time for randomly generated data with high chance of sharing

High chance of sharing: elements are randomly taken from [1, 2*len(a)]. Low chance of sharing: elements are randomly taken from [1, 1000*len(a)].

Up to now, this analysis supposed both lists are of the same size. In case of two lists of different sizes, for example a is much smaller, isdisjoint() is always faster:

Element sharing test execution time on two differently sized lists when shared at the beginning Element sharing test execution time on two differently sized lists when shared at the end

Make sure that the a list is the smaller, otherwise the performance decreases. In this experiment, the a list size was set constant to 5.

In summary:

  • If the lists are very small (< 10 elements), set(a).isdisjoint(b) is always the fastest.
  • If the elements in the lists are sorted or have a regular structure that you can take advantage of, the generator expression any(i in a for i in b) is the fastest on large list sizes;
  • Test the set intersection with set(a).isdisjoint(b), which is always faster than bool(set(a) & set(b)).
  • The hybrid "iterate through list, test on set" a = set(a); any(i in a for i in b) is generally slower than other methods.
  • The generator expression and the hybrid are much slower than the two other approaches when it comes to lists without sharing elements.

In most cases, using the isdisjoint() method is the best approach as the generator expression will take much longer to execute, as it is very inefficient when no elements are shared.

Comments