jxn -4 years ago 143
Python Question

# Python using for loops to do a groupby is too slow, faster way?

I have a unique list of integers extracted from the first index of each tuple in

`l2`
.

I am trying to do something like a
`groupby`
in
`l2`
on the first index of the tuple (i.e each item in the unique list) so I can count the number of occurrences that the second index of the tuple in
`l2`
exist in
`l3`
. - please see the example.

To do this, I have a dictionary set up for each item in the unique list and it resets after each loop. The dict key is each value in
`l3`
.

My code works fine, it is just very slow when I have a ton of data because of the many loops.

Any way to make this more efficient and fast?

``````l1 = [1,2,3]
l2 = [(1,'a'),(3,'c'),(3,'b'),(2,'b'),(1,'a'),(3,'a')]
l3 = ['a','b']

d = defaultdict(int)
for i in l1:
d = d.fromkeys(d, 0) # reset dict values to 0
for t in l2:
if i==t[0]:
if t[1] in l3:
d[t[1]] +=1
print d
``````

Example:

``````when i == 1:
d = {'a':2,'b':0}
``````

Make `l3` a set for fast membership testing. Put all `l1`-based counters into a dictionary; that way you don't need to loop over `l1` at all and just use the `t[0]` value to select the right counter:

``````counts = {i: defaultdict(int) for i in l1}
s3 = set(l3)
for t0, t1 in l2:
# only count if t[1] is included in l3, and t[0] is in l1
if t1 not in s3 or t0 not in counts:
continue
counts[t0][t1] += 1

for d in counts.itervalues():
print d
``````

This removes two multipliers; `len(l1)` and `len(l3)`, so what was once a O(NKM) loop is now a O(K) loop.

This does increase the memory requirements, as you now need to track `len(l1)` `defaultdict` objects. Allocating the memory for those objects up-front will take some time too.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download