P-M - 2 months ago 13

Python Question

I have a piece of code that loops through a set of nodes and counts the path length connecting the given node to each other node in my network. For each node my code returns me a list,

`b`

`local_path_length_hist = {}`

for ver in vertices:

dist = gt.shortest_distance(g, source=g.vertex(ver))

a = dist.a

#Delete some erroneous entries

b = a[a!=2147483647]

for dist in b:

if dist in local_path_length_hist:

local_path_length_hist[dist]+=1

else:

local_path_length_hist[dist]=1

This presumably is very crude coding as far as the dictionary update is concerned. Is there a better way of doing this? What is the most efficient way of creating this histogram?

Answer

Since `gt.shortest_distance`

returns an `ndarray`

, `numpy`

math is fastest:

```
max_dist = len(vertices) - 1
hist_length = max_dist + 2
no_path_dist = max_dist + 1
hist = np.zeros(hist_length)
for ver in vertices:
dist = gt.shortest_distance(g, source=g.vertex(ver))
hist += np.bincount(dist.a.clip(max=no_path_dist))
```

I use the `ndarray`

method `clip`

to bin the `2147483647`

values returned by `gt.shortest_distance`

at the last position of `hist`

. Without use of `clip`

, `hist's`

`size`

would have to be `2147483647 + 1`

on 64-bit Python, or `bincount`

would produce a `ValueError`

on 32-bit Python. So the last position of `hist`

will contain a count of all non-paths; you can ignore this value in your histogram analysis.

As the below timings indicate, using `numpy`

math to obtain a histogram is well over an order of magnitude faster than using either `defaultdicts`

or `counters`

(Python 3.4):

```
# vertices numpy defaultdict counter
9000 0.83639 38.48990 33.56569
25000 8.57003 314.24265 262.76025
50000 26.46427 1303.50843 1111.93898
```

My computer is too slow to test with `9 * (10**6)`

vertices, but relative timings seem pretty consistent for varying number of vertices (as we would expect).

*timing code*:

```
from collections import defaultdict, Counter
import numpy as np
from random import randint, choice
from timeit import repeat
# construct distance ndarray such that:
# a) 1/3 of values represent no path
# b) 2/3 of values are a random integer value [0, (num_vertices - 1)]
num_vertices = 50000
no_path_length = 2147483647
distances = []
for _ in range(num_vertices):
rand_dist = randint(0,(num_vertices-1))
distances.append(choice((no_path_length, rand_dist, rand_dist)))
dist_a = np.array(distances)
def use_numpy_math():
max_dist = num_vertices - 1
hist_length = max_dist + 2
no_path_dist = max_dist + 1
hist = np.zeros(hist_length, dtype=np.int)
for _ in range(num_vertices):
hist += np.bincount(dist_a.clip(max=no_path_dist))
def use_default_dict():
d = defaultdict(int)
for _ in range(num_vertices):
for dist in dist_a:
d[dist] += 1
def use_counter():
hist = Counter()
for _ in range(num_vertices):
hist.update(dist_a)
t1 = min(repeat(stmt='use_numpy_math()', setup='from __main__ import use_numpy_math',
repeat=3, number=1))
t2 = min(repeat(stmt='use_default_dict()', setup='from __main__ import use_default_dict',
repeat= 3, number=1))
t3 = min(repeat(stmt='use_counter()', setup='from __main__ import use_counter',
repeat= 3, number=1))
print('%0.5f, %0.5f. %0.5f' % (t1, t2, t3))
```