Silenus Silenus - 21 days ago 5
Python Question

Filtering Lists of Tuples by Elements of Tuples

I am working in Python (2.7.9) and am trying to filter a list of tuples by a list of elements of those tuples. In particular, my objects have the following form:

tuples = [('a', ['a1', 'a2']), ('b',['b1', 'b2']), ('c',['c1', 'c2'])]
filter = ['a', 'c']


I am new to Python and the easiest way to filter the tuples that I could discover was with the following list comprehension:

tuples_filtered = [(x,y) for (x,y) in tuples if x in filter]


The resulting filtered list looks like:

tuples_filtered = [('a', ['a1', 'a2']), ('c',['c1', 'c2'])]


Unfortunately, this list comprehension seems to be very inefficient. I suspect this is because my list of tuples is much larger than my filter, the list of strings. In particular, the filter list contains 30,000 words and the list of tuples contains about 134,000 2-tuples.

The first elements of the 2-tuples are largely distinct, but there are a few instances of duplicate first elements (not sure how many, actually, but by comparison to the cardinality of the list it's not many).

My question: Is there a more efficient way to filter a list of tuples by a list of elements of those tuples?

(Apologies if this is off-topic or a dupe.)

Related question (which does not mention efficiency):

Filter a list of lists of tuples

Answer

In a comment you write:

The filter list contains 30,000 words and the list of tuples contains about 134,000 2-tuples.

in containment tests against a list takes O(N) linear time, which is slow when you do this 134k times. Each time you have to iterate over all those elements to find a match. Given that you are filtering, not all those first elements are going to be present in the 30k list, so you are executing up to 30k * 134k == 4 billion comparisons.

Use a set instead:

filter_set = set(filter)

Set containment tests are O(1) constant time; now you reduced your problem to 134k tests.

A much smaller component of time you can avoid spending is the tuple assignment; use indexing to extract just the one element you are testing with:

tuples_filtered = [tup for tup in tuples if tup[0] in filter_set]