thegreatt thegreatt - 1 month ago 5
Python Question

Most efficient way to sum huge 2D NumPy array, grouped by ID column?

I have a massive data array (500k rows) that looks like:

id value score
1 20 20
1 10 30
1 15 0
2 12 4
2 3 8
2 56 9
3 6 18
...


As you can see, there is a non-unique ID column to the left, and various scores in the 3rd column.

I'm looking to quickly add up all of the scores, grouped by IDs. In SQL this would look like
SELECT sum(score) FROM table GROUP BY id


With NumPy I've tried iterating through each ID, truncating the table by each ID, and then summing the score up for that table.

table_trunc = table[(table == id).any(1)]
score = sum(table_trunc[:,2])


Unfortunately I'm finding the first command to be dog-slow. Is there any more efficient way to do this?

Answer

you can use bincount():

import numpy as np

ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]

print np.bincount(ids, weights=data)

the output is [ 0. 50. 21. 18.], which means the sum of id==0 is 0, the sum of id==1 is 50.

Comments