titipat titipat - 3 months ago 8
Python Question

Efficient way to convert dictionary of list to pair list of key and value

I have dictionary of list as follows (it can be more than 1M elements, also assume dictionary is sorted by key)

import scipy.sparse as sp
d = {0: [0,1], 1: [1,2,3],
2: [3,4,5], 3: [4,5,6],
4: [5,6,7], 5: [7],
6: [7,8,9]}


I want to know what is the most efficient way (fastest way for large dictionary) to convert it into list of row and column index like:

r_index = [0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 6, 6, 6]
c_index = [0, 1, 1, 2, 3, 3, 4, 5, 4, 5, 6, 5, 6, 7, 7, 7, 8, 9]


Here are some solutions that I have so far:


  1. Using iteration

    row_ind = [k for k, v in d.iteritems() for _ in range(len(v))] # or d.items() in Python 3
    col_ind = [i for ids in d.values() for i in ids]

  2. Using pandas library

    import pandas as pd
    df = pd.DataFrame.from_dict(d, orient='index')
    df = df.stack().reset_index()
    row_ind = list(df['level_0'])
    col_ind = list(df[0])

  3. Using itertools

    import itertools
    indices = [(x,y) for x, y in itertools.chain.from_iterable([itertools.product((k,), v) for k, v in d.items()])]
    indices = np.array(indices)
    row_ind = indices[:, 0]
    col_ind = indices[:, 1]



I'm not sure which way is the fastest way to deal with this problem if I have a lot of elements in my dictionary. Thanks!

Answer

The first rule of thumb of optimization in python is, to make sure that your innermost loop is outsourced to some library function. This only applies for cpython - pypy is a completely different story. In your case using extend is giving some significant speedup.

import time
l = range(10000)
x = dict([(k, list(l)) for k in range(1000)])

def org(d):
    row_ind = [k for k, v in d.items() for _ in range(len(v))]
    col_ind = [i for ids in d.values() for i in ids]

def ext(d):
    row_ind = [k for k, v in d.items() for _ in range(len(v))]
    col_ind = []
    for ids in d.values():
        col_ind.extend(ids)

def ext_both(d):
    row_ind = []
    for k, v in d.items():
        row_ind.extend([k] * len(v))
    col_ind = []
    for ids in d.values():
        col_ind.extend(ids)

functions = [org, ext, ext_both]
for func in functions:
    begin = time.time()
    func(x)
    elapsed = time.time() - begin
    print(func.__name__ + ": "  + str(elapsed))

Output when using python2:

org: 0.512559890747
ext: 0.340406894684
ext_both: 0.149670124054