Katie - 1 year ago 118
Python Question

Python heapq vs sorted speed for pre-sorted lists

I have a reasonably large number n=10000 of sorted lists of length k=100 each. Since merging two sorted lists takes linear time, I would imagine its cheaper to recursively merge the sorted lists of length O(nk) with

`heapq.merge()`
in a tree of depth log(n) than to sort the entire thing at once with
`sorted()`
in O(nklog(nk)) time.

However, the
`sorted()`
approach seems to be 17-44x faster on my machine. Is the implementation of
`sorted()`
that much faster than
`heapq.merge()`
that it outstrips the asymptotic time advantage of the classic merge?

``````import itertools
import heapq

data = [range(n*8000,n*8000+10000,100) for n in range(10000)]

# Approach 1
for val in heapq.merge(*data):
test = val

# Approach 2
for val in sorted(itertools.chain(*data)):
test = val
``````

CPython's `list.sort()` uses an adaptive merge sort, which identifies natural runs in the input, and then merges them "intelligently". It's very effective at exploiting many kinds of pre-existing order. For example, try sorting `range(N)*2` (in Python 2) for increasing values of `N`, and you'll find the time needed grows linearly in `N`.

So the only real advantage of `heapq.merge()` in this application is lower peak memory use if you iterate over the results (instead of materializing an ordered list containing all the results).

In fact, `list.sort()` is taking more advantage of the structure in your specific data than the `heapq.merge()` approach. I have some insight into this, because I wrote Python's `list.sort()` ;-)

As discussed a bit in comments, `list.sort()` plays lots of engineering tricks that may cut the number of comparisons needed over what `heapq.merge()` needs. It depends on the data. Here's a quick account of what happens for the specific data in your question. First define a class that counts the number of comparisons performed (note that I'm using Python 3, so have to account for all possible comparisons):

``````class V(object):
def __init__(self, val):
self.val = val

def __lt__(a, b):
global ncmp
ncmp += 1
return a.val < b.val

def __eq__(a, b):
global ncmp
ncmp += 1
return a.val == b.val

def __le__(a, b):
raise ValueError("unexpected comparison")

__ne__ = __gt__ = __ge__ = __le__
``````

`sort()` was deliberately written to use only `<` (`__lt__`). It's more of an accident in `heapq` (and, as I recall, even varies across Python versions), but it turns out `.merge()` only required `<` and `==`. So those are the only comparisons the class defines in a useful way.

Then changing your data to use instances of that class:

``````data = [[V(i) for i in range(n*8000,n*8000+10000,100)]
for n in range(10000)]
``````

Then run both methods:

``````ncmp = 0
for val in heapq.merge(*data):
test = val
print(format(ncmp, ","))

ncmp = 0
for val in sorted(itertools.chain(*data)):
test = val
print(format(ncmp, ","))
``````

The output is kinda remarkable:

``````43,207,638
1,639,884
``````

So `sorted()` required far fewer comparisons than `merge()`, for this specific data. And that's the primary reason it's much faster.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download