gsamaras - 10 days ago 6
C++ Question

# How to store binary data when you only care about speed?

I have N points in D dimensions, where let's say N is 1 million and D 1 hundred. All my points have binary coordinates, i.e. {0, 1}^D, and I am only interested in speed.

Currently my implementation uses

`std::vector<int>`
. I am wondering if I could benefit in terms of faster execution by changing my . I am only doing insertions and searches (I don't change the bits).

All related questions I found mention
`std::vector<char>`
,
`std::vector<bool>`
and
`std::bitset`
, but all mention the space benefits one should get by using such structures.

What's the appropriate data structure, when speed is of main concern, for binary data in C++?

I intend to populate my data structure with the binary data and then do a lot of contiguous searches (I mean that I don't really care for the i-th coordinate of a point, if I am accessing a point I will access all of its coordinates continuously). I will compute the Hamming distance between each other.

Locality of reference will likely be the driving force. So it's fairly obvious that you represent the `D` coordinates of a single point as a contiguous bitvector. `std::bitset<D>` would be a logical choice.
However, the next important thing to realize is that you see locality benefits easily up to 4KB. This means that you should not pick a single point and compare it against all other N-1 points. Instead, group points in sets of 4KB each, and compare those groups. Both ways are `O(N*N)`, but the second will be much faster.
You may be able to beat `O(N*N)` by use of the triangle inequality - `Hamming(a,b)+Hamming(b,c) >= Hamming (a,c)`. I'm just wondering how. It probably depends on how you want your output. The naive output would be a N*N set of distances, and that's unavoidably `O(N*N)`.