jxn jxn -4 years ago 143
Python Question

Python using for loops to do a groupby is too slow, faster way?

I have a unique list of integers extracted from the first index of each tuple in

l2
.

I am trying to do something like a
groupby
in
l2
on the first index of the tuple (i.e each item in the unique list) so I can count the number of occurrences that the second index of the tuple in
l2
exist in
l3
. - please see the example.

To do this, I have a dictionary set up for each item in the unique list and it resets after each loop. The dict key is each value in
l3
.

My code works fine, it is just very slow when I have a ton of data because of the many loops.

Any way to make this more efficient and fast?

l1 = [1,2,3]
l2 = [(1,'a'),(3,'c'),(3,'b'),(2,'b'),(1,'a'),(3,'a')]
l3 = ['a','b']

d = defaultdict(int)
for i in l1:
d = d.fromkeys(d, 0) # reset dict values to 0
for t in l2:
if i==t[0]:
if t[1] in l3:
d[t[1]] +=1
print d


Example:

when i == 1:
d = {'a':2,'b':0}

Answer Source

Make l3 a set for fast membership testing. Put all l1-based counters into a dictionary; that way you don't need to loop over l1 at all and just use the t[0] value to select the right counter:

counts = {i: defaultdict(int) for i in l1}
s3 = set(l3)
for t0, t1 in l2:
    # only count if t[1] is included in l3, and t[0] is in l1
    if t1 not in s3 or t0 not in counts:
        continue
    counts[t0][t1] += 1

for d in counts.itervalues():
    print d

This removes two multipliers; len(l1) and len(l3), so what was once a O(NKM) loop is now a O(K) loop.

This does increase the memory requirements, as you now need to track len(l1) defaultdict objects. Allocating the memory for those objects up-front will take some time too.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download