jxn jxn -4 years ago 143
Python Question

Python using for loops to do a groupby is too slow, faster way?

I have a unique list of integers extracted from the first index of each tuple in


I am trying to do something like a
on the first index of the tuple (i.e each item in the unique list) so I can count the number of occurrences that the second index of the tuple in
exist in
. - please see the example.

To do this, I have a dictionary set up for each item in the unique list and it resets after each loop. The dict key is each value in

My code works fine, it is just very slow when I have a ton of data because of the many loops.

Any way to make this more efficient and fast?

l1 = [1,2,3]
l2 = [(1,'a'),(3,'c'),(3,'b'),(2,'b'),(1,'a'),(3,'a')]
l3 = ['a','b']

d = defaultdict(int)
for i in l1:
d = d.fromkeys(d, 0) # reset dict values to 0
for t in l2:
if i==t[0]:
if t[1] in l3:
d[t[1]] +=1
print d


when i == 1:
d = {'a':2,'b':0}

Answer Source

Make l3 a set for fast membership testing. Put all l1-based counters into a dictionary; that way you don't need to loop over l1 at all and just use the t[0] value to select the right counter:

counts = {i: defaultdict(int) for i in l1}
s3 = set(l3)
for t0, t1 in l2:
    # only count if t[1] is included in l3, and t[0] is in l1
    if t1 not in s3 or t0 not in counts:
    counts[t0][t1] += 1

for d in counts.itervalues():
    print d

This removes two multipliers; len(l1) and len(l3), so what was once a O(NKM) loop is now a O(K) loop.

This does increase the memory requirements, as you now need to track len(l1) defaultdict objects. Allocating the memory for those objects up-front will take some time too.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download