Dance Party2 - 1 year ago 56
Python Question

# Python Compare Tokenized Lists

I need the fastest-possible solution to this problem as it will be applied to a huge data set:

Given this master list:

``````m=['abc','bcd','cde','def']
``````

...and this reference list of lists:

``````r=[['abc','def'],['bcd','cde'],['abc','def','bcd']]
``````

I'd like to compare each list within r to the master list (m) and generate a new list of lists. This new object will have a 1 for matches based on the order in m and 0 for non-matches. So the new object (list of lists) will always have the lists of the same length as m.
Here's what I would expect based on m and r above:

``````[[1,0,0,1],[0,1,1,0],[1,1,0,1]]
``````

Because the first element of r is
`['abc','def']`
and has a match
with the 1st and 4th elements of m, the result is then
`[1,0,0,1]`
.

Here's my approach so far (probably way too slow and is missing zeros):

``````output=[]
for i in r:
output.append([1 for x in m if x in i])
``````

resulting in:

``````[[1, 1], [1, 1], [1, 1, 1]]
``````

You can use a nested list comprehension like this:

``````>>> m = ['abc','bcd','cde','def']
>>> r = [['abc','def'],['bcd','cde'],['abc','def','bcd']]
>>> [[1 if mx in rx else 0 for mx in m] for rx in r]
[[1, 0, 0, 1], [0, 1, 1, 0], [1, 1, 0, 1]]
``````

Also, you could shorten the `1 if ... else 0` using `int(...)`, and you can convert the sublists of `r` to `set`, so that the individual `mx in rx` lookups are faster.

``````>>> [[int(mx in rx) for mx in m] for rx in r]
[[1, 0, 0, 1], [0, 1, 1, 0], [1, 1, 0, 1]]
>>> [[int(mx in rx) for mx in m] for rx in map(set, r)]
[[1, 0, 0, 1], [0, 1, 1, 0], [1, 1, 0, 1]]
``````

While `int(...)` is a bit shorter than `1 if ... else 0`, it also seems to be slower, so you probably should not use that. Converting the sublists of `r` to `set` prior to the repeated lookup should speed things up for longer lists, but for you very short example lists, it's in fact slower than the naive approach.

``````>>> %timeit [[1 if mx in rx else 0 for mx in m] for rx in r]
100000 loops, best of 3: 4.74 µs per loop
>>> %timeit [[int(mx in rx) for mx in m] for rx in r]
100000 loops, best of 3: 8.07 µs per loop
>>> %timeit [[1 if mx in rx else 0 for mx in m] for rx in map(set, r)]
100000 loops, best of 3: 5.82 µs per loop
``````

For longer lists, using `set` becomes faster, as would be expected:

``````>>> m = [random.randint(1, 100) for _ in range(50)]
>>> r = [[random.randint(1,100) for _ in range(10)] for _ in range(20)]
>>> %timeit [[1 if mx in rx else 0 for mx in m] for rx in r]
1000 loops, best of 3: 412 µs per loop
>>> %timeit [[1 if mx in rx else 0 for mx in m] for rx in map(set, r)]
10000 loops, best of 3: 208 µs per loop
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download