Katie Katie - 4 months ago 15
Python Question

Python heapq vs sorted speed for pre-sorted lists

I have a reasonably large number n=10000 of sorted lists of length k=100 each. Since merging two sorted lists takes linear time, I would imagine its cheaper to recursively merge the sorted lists of length O(nk) with

heapq.merge()
in a tree of depth log(n) than to sort the entire thing at once with
sorted()
in O(nklog(nk)) time.

However, the
sorted()
approach seems to be 17-44x faster on my machine. Is the implementation of
sorted()
that much faster than
heapq.merge()
that it outstrips the asymptotic time advantage of the classic merge?

import itertools
import heapq

data = [range(n*8000,n*8000+10000,100) for n in range(10000)]

# Approach 1
for val in heapq.merge(*data):
test = val

# Approach 2
for val in sorted(itertools.chain(*data)):
test = val

Answer

CPython's list.sort() uses an adaptive merge sort, which identifies natural runs in the input, and then merges them "intelligently". It's very effective at exploiting many kinds of pre-existing order. For example, try sorting range(N)*2 (in Python 2) for increasing values of N, and you'll find the time needed grows linearly in N.

So the only real advantage of heapq.merge() in this application is lower peak memory use if you iterate over the results (instead of materializing an ordered list containing all the results).

In fact, list.sort() is taking more advantage of the structure in your specific data than the heapq.merge() approach. I have some insight into this, because I wrote Python's list.sort() ;-)

(BTW, I see you already accepted an answer, and that's fine by me - it's a good answer. I just wanted to give a bit more info.)

ABOUT THAT "more advantage"

As discussed a bit in comments, list.sort() plays lots of engineering tricks that may cut the number of comparisons needed over what heapq.merge() needs. It depends on the data. Here's a quick account of what happens for the specific data in your question. First define a class that counts the number of comparisons performed (note that I'm using Python 3, so have to account for all possible comparisons):

class V(object):
    def __init__(self, val):
        self.val = val

    def __lt__(a, b):
        global ncmp
        ncmp += 1
        return a.val < b.val

    def __eq__(a, b):
        global ncmp
        ncmp += 1
        return a.val == b.val

    def __le__(a, b):
        raise ValueError("unexpected comparison")

    __ne__ = __gt__ = __ge__ = __le__

sort() was deliberately written to use only < (__lt__). It's more of an accident in heapq (and, as I recall, even varies across Python versions), but it turns out .merge() only required < and ==. So those are the only comparisons the class defines in a useful way.

Then changing your data to use instances of that class:

data = [[V(i) for i in range(n*8000,n*8000+10000,100)]
        for n in range(10000)]

Then run both methods:

ncmp = 0
for val in heapq.merge(*data):
    test = val
print(format(ncmp, ","))

ncmp = 0
for val in sorted(itertools.chain(*data)):
    test = val
print(format(ncmp, ","))

The output is kinda remarkable:

43,207,638
1,639,884

So sorted() required far fewer comparisons than merge(), for this specific data. And that's the primary reason it's much faster.

Comments