RootAtShell RootAtShell - 27 days ago 8
C++ Question

Large Data Set for Processing, need to maintain original data set

Here's my problem definition: I have an array of seven million indices, each containing a label. So, for simplicity, here's an example array that I'm dealing with: [1 2 3 3 3 5 2 1 7].

I need to go through this array and every time I come across a label, input the location of the label into a "set" with all others of the same label. With the array being so large, I want to access only a specific label's location at any given point, so let's say, I want to access only the locations of 3 and process those locations and change them to 5's, but I want to do more than just one operation and not only that, I want to do it on all labels, separately. In a small array like in my example, it seems trivial just to stick with the array. However, with an array of seven million points, it is much more time expensive to complete the searching for said labels and then operate on them.

To clear up confusion, using my example, I want the example array to give me the following:


  • 1 mapped to a set containing 0 and 7

  • 2 mapped to a set containing 1 and 6

  • 3 mapped to a set containing 2, 3, and 4

  • 5 mapped to a set containing 5



Originally, I did my processing on the original array and simply operated on the array. This took roughly ~30 seconds to determine the number of corresponding indices for each label (so I was able to determine that the size of 1 was two, size of six was two, size of 3 was three, etc. However, it did not produce the locations of said labels using this method. Therefore, there was added time throughout the rest of my processing finding the locations of each label as well although it was sped up by adding the termination that once it found all the indices of the referenced label, to stop searching.

Next step, I used a
map<int,set<int>>
but this ultimately led to an increase in time to ~100 seconds but decreased time in processing later down the road, but not enough to justify the heavy increase in time.

I haven't implemented it yet, but as an additional step, I am planning on trying to initialize an array of sets, with the indices corresponding to the label and trying to do it this method.

I have also tried hash_maps as well to no avail. Unordered_sets and unordered_maps are not included in the STL in Visual Studio 2005 so I have not implemented the above with these two structures.

Key points:
I have pre-processed the array such that I know the maximum label, and that all labels are consecutive (there are no gaps between the minimum label and the maximum). However, they are randomly dispersed in the original array. This may prove useful in initialization of a set size data structure.
Order of the indices corresponding to the labels is not important. Order of the labels in their given data structure is also not important.

Edit:
For background, the array corresponds to a binary image, and I implemented binary sequential labeling to output an array of same size as the binary image of UINT16 with all binary blobs labeled. What I want to do now is to obtain a map of the points that make up each blob as efficiently as possible.

Answer Source

Why do you use such complicated data structures for that task? Just create a vector of vectors to store all the positions of each label and that's it. And you also can avoid annoying vector memory allocation by pre-processing how much space you need for each label. Something like that:

vector <int> count(N);
for(size_t i = 0; i < N; ++i)
    ++count[dataArray[i]];
vector < vector <int> > labels(N);
vector <int> last(N);
for(size_t i = 0; i < N; ++i)
    labels[i].resize(count[i]);
for(size_t i = 0; i < N; ++i) {
    labels[dataArray[i]][last[dataArray[i]]] = i;
    ++last[dataArray[i]];
}

It will work in O(N) time, what looks like 1 second for your seven million of integers.