AnT AnT - 16 days ago 9
C++ Question

`std::list<>::sort()` - why the sudden switch to top-down strategy?

I remember that since the beginning of times the most popular approach to implementing

std::list<>::sort()
was the classic Merge Sort algorithm implemented in bottom-up fashion (see also What makes the gcc std::list sort implementation so fast?).

I remember seeing someone aptly refer to this strategy as "onion chaining" approach.

At least that's the way it is in GCC's implementation of C++ standard library (see, for example, here). And this is how it was in old Dimkumware's STL in MSVC version of standard library, as well as in all versions of MSVC all the way to VS2013.

However, the standard library supplied with VS2015 suddenly no longer follows this sorting strategy. The library shipped with VS2015 uses a rather straightforward recursive implementation of top-down Merge Sort. This strikes me as strange, since top-down approach requires access to the mid-point of the list in order to split it in half. Since
std::list<>
does not support random access, the only way to find that mid-point is to literally iterate through half of the list. Also, at the very beginning it is necessary to know the total number of elements in the list (which was not necessarily an O(1) operation before C++11).

Nevertheless,
std::list<>::sort()
in VS2015 does exactly that. Here's an excerpt from that implementation that locates the mid-point and performs recursive calls

...
iterator _Mid = _STD next(_First, _Size / 2);
_First = _Sort(_First, _Mid, _Pred, _Size / 2);
_Mid = _Sort(_Mid, _Last, _Pred, _Size - _Size / 2);
...


As you can see, they just nonchalantly use
std::next
to walk through the first half of the list and arrive at
_Mid
iterator.

What could be the reason behind this switch, I wonder? All I see is a seemingly obvious inefficiency of repetitive calls to
std::next
at each level of recursion. Naive logic says that this is slower. If they are willing to pay this kind of price, they probably expect to get something in return. What are they getting then? I don't immediately see this algorithm as having better cache behavior (compared to the original bottom-up approach). I don't immediately see it as behaving better on pre-sorted sequences.

Granted, since C++11
std::list<>
is basically required to store its element count, which makes the above slightly more efficient, since we always know the element count in advance. But that still does not seem to be enough to justify the sequential scan on each level of recursion.

(Admittedly, I haven't tried to race the implementations against each other. Maybe there are some surprises there.)

Answer

I noticed this change back in July, 2016 and emailed P.J. Plauger about this change on August 1, 2016. A snippet of his reply:

Interestingly enough, our change log doesn't reflect this change. That probably means it was "suggested" by one of our larger customers and got by me on the code review. All I know now is that the change came in around the autumn of 2015. When I reviewed the code, the first thing that struck me was the line:

    iterator _Mid = _STD next(_First, _Size / 2);

which, of course, can take a very long time for a large list.

The code looks a bit more elegant than what I wrote in early 1995(!), but definitely has worse time complexity. That version was modeled after the approach by Stepanov, Lee, and Musser in the original STL. They are seldom found to be wrong in their choice of algorithms.

He later replied that he would revert the Dinkumware (which I assume is the company he works for) version to the prior (bottom up using small array) version, but Microsoft may or may not pickup this change.

For a benchmark of this, I created a linked list with 4 million elements, each consisting of one 64 bit unsigned integer, assuming I would end up with a doubly linked list of nearly sequentially ordered nodes (even though they would be dynamically allocated), filled them with random numbers, then sorted them. The nodes don't move, only the linkage is changed, but now traversing the list accesses the nodes in random order. I then filled those randomly ordered nodes with another set of random numbers and sorted them again. I compared the 2015 top down approach with the prior bottom up approach modified to match the other changes made for 2015 (sort() now calls sort() with a predicate compare function, rather than having two separate functions). These are the results:

sequential nodes: 2015 version 1.6 seconds, prior version 1.5 seconds
random nodes:     2015 version 4.0 seconds, prior version 2.8 seconds

For sequential nodes, the prior version is only a bit faster, but for random nodes, the prior version is 30% faster.

Below is replacement code for sort() using the prior bottom up with small array (_BinList[]) method. Make a backup of < list > or copy it to a local directory and use that instead, then replace the sort() code with the following:

    void sort()
        {   // order sequence, using operator<
        sort(less<>());
        }

    template<class _Pr2>
        void sort(_Pr2 _Pred)
        {   // order sequence, using _Pred
        if (2 > this->_Mysize())
            return;
        const size_t _MAXBINS = 25;
        _Myt _Templist, _Binlist[_MAXBINS];
        while (!empty())
            {
            // _Templist = next element
            _Templist._Splice_same(_Templist.begin(), *this, begin(),
                ++begin(), 1);
            // merge with array of ever larger bins
            size_t _Bin;
            for (_Bin = 0; _Bin < _MAXBINS && !_Binlist[_Bin].empty();
                ++_Bin)
                _Templist.merge(_Binlist[_Bin], _Pred);
            // don't go past end of array
            if (_Bin == _MAXBINS)
                _Bin--;
            // update bin with merged list, empty _Templist
            _Binlist[_Bin].swap(_Templist);
            }
            // merge bins back into caller's list
            for (size_t _Bin = 0; _Bin < _MAXBINS; _Bin++)
                if(!_Binlist[_Bin].empty())
                    this->merge(_Binlist[_Bin], _Pred);
        }

I made some minor changes. The original code kept track of the actual maximum bin in a variable named _Maxbin, but the overhead in the final merge is small enough that I removed the code associated with _Maxbin. During the array build, the original code's inner loop merged into a _Binlist[] element, followed by a swap into _Templist, which seemed pointless. I changed the inner loop to just merge into _Templist, only swapping once an empty _Binlist[] element is found.

update - T.C below points out a potential issue in the case of a custom allocator without a default constructor. I don't know how common this is: why a program would not want to allocate from the heap as usual, although there is a stack overflow thread about making such an allocator "STL compatible":

Dealing with no default constructor in custom allocators

Since the 2015 top down version of std::list::sort only involves iterators, there are no allocations of lists or nodes, so it's not an issue, other than the performance hit. The bottom up approach dates back to 1995 and as shown above allocates (26) temporary lists which in each in turn allocate a sentinel node as is done with the bottom up approach.

For T.C. A snippet of Visual Studio's < list >

        // TEMPLATE CLASS list
template<class _Ty,
    class _Alloc = allocator<_Ty> >
    class list
        : public _List_buy<_Ty, _Alloc>
    {   // bidirectional linked list
public:
    typedef list<_Ty, _Alloc> _Myt;
Comments