user7092994 user7092994 - 1 month ago 13
C++ Question

Inserting to unordered_map takes too much time

I wrote a program that reads numbers out of a file (about 500,000 of them), and inserts them to a data structure. the numbers are distinct.
I'm inserting the numbers to an

unordered_map
with another struct (using
std::make_pair(myNumber, emptyStruct))
.

And after the insertion of all the numbers, I'm using it to search only a couple of hundred times. I never delete the DS until the program finish running.

After profiling, I've noticed that the insert operation takes about 50% of the running time. (There is also some other code, that runs as many times as the insertion, but it doesn't take so much time).

I thought maybe the resizing takes time, so I used the reserve function with 500,000, but the results are still the same.

As far as I know, this DS should be O(1) insert and search (and the trade off is large memory), so I don't see why it takes so much time to insert. How can I improve my results?

Answer

Since you are specifically not using a value, and merely searching for existence, go with std::unordered_set. It does what you wanted when you made a dummy value to go with every key in the map.

First, I want to re-iterate what everyone said: inserting 500,000 items to use it a few hundred times is going to take up a sizable chunk of your time, and you can't really avoid that, unless you can turn it around -- build a set of the things you are searching for, then search that 500,000 times.

All that said, I was able to get some improvement on the insertion of 500,000 items in a test app, by taking into account the nature of hash tables:

Reviewing http://en.cppreference.com/w/cpp/container/unordered_map, I found these:

[Insert] Complexity: Average case: O(1), worst case O(size())

By default, unordered_map containers have a max_load_factor of 1.0.

When you reserve space for 500000 items, you get 500000 buckets. If you put 500000 pieces of data in 500000 buckets, you are going to get a lot of collisions. I reserved extra space, and it was faster.

If you really need speed, and are willing to get some errors, look into bloom filters.

Comments