Deniz Deniz - 11 days ago 6
C++ Question

C++ How to merge sorted vectors into a sorted vector / pop the least element from all of them?

I have a collection of about a hundred or so sorted

vector<int>
's Although most vectors have a small number of integers in them, some of the vectors contain a large (>10K) of them (thus the vectors don't necessarily have the same size).

What I'd like to do essentially iterate through smallest to largest integer, that are contained in all these sorted vectors.

One way to do it would be to merge all these sorted vectors into a sorted vector & simply iterate. Thus,

Question 1: What is the fastest way to merge sorted vectors into a sorted vector?

I'm sure on the other hand there are faster / clever ways to accomplish this without merging & re-sorting the whole thing -- perhaps popping the smallest integer iteratively from this collection of sorted vectors; without merging them first.. so:

Question 2: What is the fasted / best way to pop the least element from a bunch of sorted
vector<int>
's?




Based on replies below, and the comments to the question I've implemented an approach where I make a priority queue of iterators for the sorted vectors. I'm not sure if this is performance-efficient, but it seems to be very memory-efficient. I consider the question still open, since I'm not sure we've established the fastest way yet.

// compare vector pointers by integers pointed
struct cmp_seeds {
bool operator () (const pair< vector<int>::iterator, vector<int>::iterator> p1, const pair< vector<int>::iterator, vector<int>::iterator> p2) const {
return *(p1.first) > *(p2.first);
}
};

int pq_heapsort_trial() {

/* Set up the Sorted Vectors */
int a1[] = { 2, 10, 100};
int a2[] = { 5, 15, 90, 200};
int a3[] = { 12 };

vector<int> v1 (a1, a1 + sizeof(a1) / sizeof(int));
vector<int> v2 (a2, a2 + sizeof(a2) / sizeof(int));
vector<int> v3 (a3, a3 + sizeof(a3) / sizeof(int));

vector< vector <int> * > sorted_vectors;
sorted_vectors.push_back(&v1);
sorted_vectors.push_back(&v2);
sorted_vectors.push_back(&v3);
/* the above simulates the "for" i have in my own code that gives me sorted vectors */

pair< vector<int>::iterator, vector<int>::iterator> c_lead;
cmp_seeds mycompare;

priority_queue< pair< vector<int>::iterator, vector<int>::iterator>, vector<pair< vector<int>::iterator, vector<int>::iterator> >, cmp_seeds> cluster_feeder(mycompare);


for (vector<vector <int> *>::iterator k = sorted_vectors.begin(); k != sorted_vectors.end(); ++k) {
cluster_feeder.push( make_pair( (*k)->begin(), (*k)->end() ));
}


while ( cluster_feeder.empty() != true) {
c_lead = cluster_feeder.top();
cluster_feeder.pop();
// sorted output
cout << *(c_lead.first) << endl;

c_lead.first++;
if (c_lead.first != c_lead.second) {
cluster_feeder.push(c_lead);
}
}

return 0;
}

Answer

One option is to use a std :: priority queue to maintain a heap of iterators, where the iterators bubble up the heap depending on the values they point at.

You could also consider using repeating applications of std :: inplace_merge. This would involve appending all the data together into a big vector and remembering the offsets at which each distinct sorted block begins and ends, and then passing those into inplace_merge. This would probably be faster then the heap solution, although I think fundamentally the complexity is equivalent.

Update: I've implemented the second algorithm I just described. Repeatedly doing a mergesort in place. This code is on ideone.

This works by first concatenating all the sorted lists together into one long list. If there were three source lists, this means there are four 'offsets', which are four points in the full list between which the elements are sorted. The algorithm will then pull off three of these at a time, merging the two corresponding adjacent sorted lists into one sorted list, and then remembering two of those three offsets to be used in the new_offsets.

This repeats in a loop, with pairs of adjacent sorted ranges merged together, until only one sorted range remains.

Ultimately, I think the best algorithm would involve merging the shortest pairs of adjacent ranges together first.

// http://stackoverflow.com/questions/9013485/c-how-to-merge-sorted-vectors-into-a-sorted-vector-pop-the-least-element-fro/9048857#9048857
#include <iostream>
#include <vector>
#include <algorithm>
#include <cassert>
using namespace std;

template<typename T, size_t N>
vector<T> array_to_vector( T(*array)[N] ) { // Yes, this works. By passing in the *address* of
                                            // the array, all the type information, including the
                                            // length of the array, is known at compiler. 
        vector<T> v( *array, &((*array)[N]));
        return v;
}   

void merge_sort_many_vectors() {

    /* Set up the Sorted Vectors */ 
    int a1[] = { 2, 10, 100};
    int a2[] = { 5, 15, 90, 200};
    int a3[] = { 12 };

    vector<int> v1  = array_to_vector(&a1);
    vector<int> v2  = array_to_vector(&a2);
    vector<int> v3  = array_to_vector(&a3);


    vector<int> full_vector;
    vector<size_t> offsets;
    offsets.push_back(0);

    full_vector.insert(full_vector.end(), v1.begin(), v1.end());
    offsets.push_back(full_vector.size());
    full_vector.insert(full_vector.end(), v2.begin(), v2.end());
    offsets.push_back(full_vector.size());
    full_vector.insert(full_vector.end(), v3.begin(), v3.end());
    offsets.push_back(full_vector.size());

    assert(full_vector.size() == v1.size() + v2.size() + v3.size());

    cout << "before:\t";
    for(vector<int>::const_iterator v = full_vector.begin(); v != full_vector.end(); ++v) {
            cout << ", " << *v;
    }       
    cout << endl;
    while(offsets.size()>2) {
            assert(offsets.back() == full_vector.size());
            assert(offsets.front() == 0);
            vector<size_t> new_offsets;
            size_t x = 0;
            while(x+2 < offsets.size()) {
                    // mergesort (offsets[x],offsets[x+1]) and (offsets[x+1],offsets[x+2])
                    inplace_merge(&full_vector.at(offsets.at(x))
                                 ,&full_vector.at(offsets.at(x+1))
                                 ,&(full_vector[offsets.at(x+2)]) // this *might* be at the end
                                 );
                    // now they are sorted, we just put offsets[x] and offsets[x+2] into the new offsets.
                    // offsets[x+1] is not relevant any more
                    new_offsets.push_back(offsets.at(x));
                    new_offsets.push_back(offsets.at(x+2));
                    x += 2;
            }
            // if the number of offsets was odd, there might be a dangling offset
            // which we must remember to include in the new_offsets
            if(x+2==offsets.size()) {
                    new_offsets.push_back(offsets.at(x+1));
            }
            // assert(new_offsets.front() == 0);
            assert(new_offsets.back() == full_vector.size());
            offsets.swap(new_offsets);

    }
    cout << "after: \t";
    for(vector<int>::const_iterator v = full_vector.begin(); v != full_vector.end(); ++v) {
            cout << ", " << *v;
    }
    cout << endl;
}

int main() {
        merge_sort_many_vectors();
}
Comments