Katie - 10 months ago 54

Python Question

I have a reasonably large number n=10000 of sorted lists of length k=100 each. Since merging two sorted lists takes linear time, I would imagine its cheaper to recursively merge the sorted lists of length O(nk) with

`heapq.merge()`

`sorted()`

However, the

`sorted()`

`sorted()`

`heapq.merge()`

`import itertools`

import heapq

data = [range(n*8000,n*8000+10000,100) for n in range(10000)]

# Approach 1

for val in heapq.merge(*data):

test = val

# Approach 2

for val in sorted(itertools.chain(*data)):

test = val

Answer

CPython's `list.sort()`

uses an adaptive merge sort, which identifies natural runs in the input, and then merges them "intelligently". It's very effective at exploiting many kinds of pre-existing order. For example, try sorting `range(N)*2`

(in Python 2) for increasing values of `N`

, and you'll find the time needed grows linearly in `N`

.

So the only real advantage of `heapq.merge()`

in this application is lower peak memory use *if* you iterate over the results (instead of materializing an ordered list containing all the results).

In fact, `list.sort()`

is taking *more* advantage of the structure in your specific data than the `heapq.merge()`

approach. I have some insight into this, because I wrote Python's `list.sort()`

;-)

(BTW, I see you already accepted an answer, and that's fine by me - it's a good answer. I just wanted to give a bit more info.)

As discussed a bit in comments, `list.sort()`

plays lots of engineering tricks that *may* cut the number of comparisons needed over what `heapq.merge()`

needs. It depends on the data. Here's a quick account of what happens for the specific data in your question. First define a class that counts the number of comparisons performed (note that I'm using Python 3, so have to account for all possible comparisons):

```
class V(object):
def __init__(self, val):
self.val = val
def __lt__(a, b):
global ncmp
ncmp += 1
return a.val < b.val
def __eq__(a, b):
global ncmp
ncmp += 1
return a.val == b.val
def __le__(a, b):
raise ValueError("unexpected comparison")
__ne__ = __gt__ = __ge__ = __le__
```

`sort()`

was deliberately written to use only `<`

(`__lt__`

). It's more of an accident in `heapq`

(and, as I recall, even varies across Python versions), but it turns out `.merge()`

only required `<`

and `==`

. So those are the only comparisons the class defines in a useful way.

Then changing your data to use instances of that class:

```
data = [[V(i) for i in range(n*8000,n*8000+10000,100)]
for n in range(10000)]
```

Then run both methods:

```
ncmp = 0
for val in heapq.merge(*data):
test = val
print(format(ncmp, ","))
ncmp = 0
for val in sorted(itertools.chain(*data)):
test = val
print(format(ncmp, ","))
```

The output is kinda remarkable:

```
43,207,638
1,639,884
```

So `sorted()`

required *far* fewer comparisons than `merge()`

, for this specific data. And that's the primary reason it's much faster.