knzhou - 8 months ago 36

Python Question

I have a set of numpy arrays. One of these is a list of "keys", and I'd like to rearrange the arrays into a dict of arrays keyed on that key. My current code is:

`for key, val1, val2 in itertools.izip(keys, vals1, vals2):`

dict1[key].append(val1)

dict2[key].append(val2)

This is pretty slow, since the arrays involved are millions of entries long, and this happens many times. Is it possible to rewrite this in vectorized form? The set of possible keys is known ahead of time, and there are ~10 distinct keys.

Answer

Some timings:

```
import numpy as np
import itertools
def john1024(keys, v1, v2):
d1 = {}; d2 = {};
for k in set(keys):
d1[k] = v1[k==keys]
d2[k] = v2[k==keys]
return d1,d2
def birico(keys, v1, v2):
order = keys.argsort()
keys_sorted = keys[order]
diff = np.ones(keys_sorted.shape, dtype=bool)
diff[1:] = keys_sorted[1:] != keys_sorted[:-1]
key_change = diff.nonzero()[0]
uniq_keys = keys_sorted[key_change]
v1_split = np.split(v1[order], key_change[1:])
d1 = dict(zip(uniq_keys, v1_split))
v2_split = np.split(v2[order], key_change[1:])
d2 = dict(zip(uniq_keys, v2_split))
return d1,d2
def knzhou(keys, v1, v2):
d1 = {k:[] for k in np.unique(keys)}
d2 = {k:[] for k in np.unique(keys)}
for key, val1, val2 in itertools.izip(keys, v1, v2):
d1[key].append(val1)
d2[key].append(val2)
return d1,d2
```

I used 10 keys, 20 million entries:

```
import timeit
keys = np.random.randint(0, 10, size=20000000) #10 keys, 20M entries
vals1 = np.random.random(keys.shape)
vals2 = np.random.random(keys.shape)
timeit.timeit("john1024(keys, vals1, vals2)", "from __main__ import john1024, keys, vals1, vals2", number=3)
11.121668815612793
timeit.timeit("birico(keys, vals1, vals2)", "from __main__ import birico, keys, vals1, vals2", number=3)
8.107877969741821
timeit.timeit("knzhou(keys, vals1, vals2)", "from __main__ import knzhou, keys, vals1, vals2", number=3)
51.76217794418335
```

So, we see than the sorting technique is a bit faster than letting Numpy find the indices corresponding to each key, but of course both are much much faster than looping in Python. Vectorization is great!

This is on Python 2.7.12, Numpy 1.9.2