titipat - 9 months ago 32

Python Question

I have dictionary of list as follows (it can be more than 1M elements, also assume dictionary is sorted by key)

`import scipy.sparse as sp`

d = {0: [0,1], 1: [1,2,3],

2: [3,4,5], 3: [4,5,6],

4: [5,6,7], 5: [7],

6: [7,8,9]}

I want to know what is the most efficient way (fastest way for large dictionary) to convert it into list of row and column index like:

`r_index = [0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 6, 6, 6]`

c_index = [0, 1, 1, 2, 3, 3, 4, 5, 4, 5, 6, 5, 6, 7, 7, 7, 8, 9]

Here are some solutions that I have so far:

- Using iteration

`row_ind = [k for k, v in d.iteritems() for _ in range(len(v))] # or d.items() in Python 3`

col_ind = [i for ids in d.values() for i in ids]

- Using pandas library

`import pandas as pd`

df = pd.DataFrame.from_dict(d, orient='index')

df = df.stack().reset_index()

row_ind = list(df['level_0'])

col_ind = list(df[0])

- Using itertools

`import itertools`

indices = [(x,y) for x, y in itertools.chain.from_iterable([itertools.product((k,), v) for k, v in d.items()])]

indices = np.array(indices)

row_ind = indices[:, 0]

col_ind = indices[:, 1]

I'm not sure which way is the fastest way to deal with this problem if I have a lot of elements in my dictionary. Thanks!

Answer

The first rule of thumb of optimization in python is, to make sure that your innermost loop is outsourced to some library function. This only applies for cpython - pypy is a completely different story. In your case using extend is giving some significant speedup.

```
import time
l = range(10000)
x = dict([(k, list(l)) for k in range(1000)])
def org(d):
row_ind = [k for k, v in d.items() for _ in range(len(v))]
col_ind = [i for ids in d.values() for i in ids]
def ext(d):
row_ind = [k for k, v in d.items() for _ in range(len(v))]
col_ind = []
for ids in d.values():
col_ind.extend(ids)
def ext_both(d):
row_ind = []
for k, v in d.items():
row_ind.extend([k] * len(v))
col_ind = []
for ids in d.values():
col_ind.extend(ids)
functions = [org, ext, ext_both]
for func in functions:
begin = time.time()
func(x)
elapsed = time.time() - begin
print(func.__name__ + ": " + str(elapsed))
```

Output when using python2:

```
org: 0.512559890747
ext: 0.340406894684
ext_both: 0.149670124054
```