titipat - 1 year ago 68

Python Question

I have dictionary of list as follows (it can be more than 1M elements, also assume dictionary is sorted by key)

`import scipy.sparse as sp`

d = {0: [0,1], 1: [1,2,3],

2: [3,4,5], 3: [4,5,6],

4: [5,6,7], 5: [7],

6: [7,8,9]}

I want to know what is the most efficient way (fastest way for large dictionary) to convert it into list of row and column index like:

`r_index = [0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 6, 6, 6]`

c_index = [0, 1, 1, 2, 3, 3, 4, 5, 4, 5, 6, 5, 6, 7, 7, 7, 8, 9]

Here are some solutions that I have so far:

- Using iteration

`row_ind = [k for k, v in d.iteritems() for _ in range(len(v))] # or d.items() in Python 3`

col_ind = [i for ids in d.values() for i in ids]

- Using pandas library

`import pandas as pd`

df = pd.DataFrame.from_dict(d, orient='index')

df = df.stack().reset_index()

row_ind = list(df['level_0'])

col_ind = list(df[0])

- Using itertools

`import itertools`

indices = [(x,y) for x, y in itertools.chain.from_iterable([itertools.product((k,), v) for k, v in d.items()])]

indices = np.array(indices)

row_ind = indices[:, 0]

col_ind = indices[:, 1]

I'm not sure which way is the fastest way to deal with this problem if I have a lot of elements in my dictionary. Thanks!

Recommended for you: Get network issues from **WhatsUp Gold**. **Not end users.**

Answer Source

The first rule of thumb of optimization in python is, to make sure that your innermost loop is outsourced to some library function. This only applies for cpython - pypy is a completely different story. In your case using extend is giving some significant speedup.

```
import time
l = range(10000)
x = dict([(k, list(l)) for k in range(1000)])
def org(d):
row_ind = [k for k, v in d.items() for _ in range(len(v))]
col_ind = [i for ids in d.values() for i in ids]
def ext(d):
row_ind = [k for k, v in d.items() for _ in range(len(v))]
col_ind = []
for ids in d.values():
col_ind.extend(ids)
def ext_both(d):
row_ind = []
for k, v in d.items():
row_ind.extend([k] * len(v))
col_ind = []
for ids in d.values():
col_ind.extend(ids)
functions = [org, ext, ext_both]
for func in functions:
begin = time.time()
func(x)
elapsed = time.time() - begin
print(func.__name__ + ": " + str(elapsed))
```

Output when using python2:

```
org: 0.512559890747
ext: 0.340406894684
ext_both: 0.149670124054
```

Recommended from our users: **Dynamic Network Monitoring from WhatsUp Gold from IPSwitch**. ** Free Download**