Sh3ljohn Sh3ljohn - 2 months ago 35
C++ Question

Data structure for sparse insertion

I am asking this question mostly for confirmation, because I am not an expert in data structures, but I think the structure that suits my need is a hashmap.

Here is my problem (which I guess is typical?):

  • We are looking at pairwise interactions between a large number of objects (say N=90k), so think about the storage as a sparse matrix;

  • There is a process, say (P), which randomly starts from one object, and computes a model which may lead to another object: I cannot predict the pairs in advance, so I need to be able to "create" entries dynamically (arguably the performance is not very critical here);

  • The process (P) may "revisit" existing pairs and update the corresponding element in the matrix: this happens a lot, and therefore I need to be able to find and update as fast as possible.

  • Finally, the process (P) is repeated millions of times, but only requires write access to the data structure, it does not need to know about the latest "state" of that storage. This feels intuitively like a detail that might be exploited to improve performance, but I don't think hashmaps do.

This last point is actually the main reason for my question here: is there a data-structure which satisfies the first three points (I'm thinking hash-map, correct?), and which would also exploit the last feature for improved performance (I'm thinking something like buffering operations and execute them in bulk asynchronously)?

EDIT: I am working with C++, and would prefer it if there was an existing library implementing that data structure. In addition, I am limited by the system requirements; I cannot use C++11 features.


I would use something like:

#include <boost/unordered_map.hpp>

class Data
    boost::unordered_map<std::pair<int,int>,double> map;

    void update(int i, int j, double v)
        map[std::pair<int,int>(i,j)] += v;
    void output();  // Prints data somewhere.

That will get you going (you may need to declare a suitable hash function). You might be able to speed things up by making the key type be a 64-bit integer, and using ((int64_t)i << 32) | j to make the index.

If nearly all the updates go to a small fraction of the pairs, you could have two maps (small and large), and directly update the small map. Every time the size of small passed a threshold, you could update large and clear small. You would need to do some carefully testing to see if this helped or not. The only reason I think it might help, is by improving cache locality.

Even if you end up using a different data structure, you can keep this class interface, and the rest of the code will be undisturbed. In particular, dropping sparsehash into the same structure will be very easy.