UebTV7NzU9QRBF7nUbcYM o UebTV7NzU9QRBF7nUbcYM o -4 years ago 36
C Question

Optimal way for storing compressed sets

As title says, I am searching for the optimal way of storing sets in memory. I am only interested in sets of bytes (array of integers from

0
to
255
where order is not important). It is not required that encoding/decoding be fast. The only necessary thing is that sets should require as little memory as possible.

The first method I came up with is to allocate array of
256
bits (
32
bytes) for each set and the bit at position
n
tells if there is
n
in the set or not. The problem with this approach is because it requires the same amount of memory even if the set are mostly empty (has only few elements).

The second approach I tried is to store sets as regular arrays. So, if a set contains
n
elements, then it will require
n + 1
bytes to be stored. The first byte represents the number of elements and other bytes represents elements. But, as we know, order in sets are not important, so something strongly tells me that there must be a way to impove this.

My third attempt is to enumerate all possible sets and the just store the index of set (integer representing its index in list of all possible sets of bytes). But, it turned out that it is absolutelly equivalent as the first approach. Basically, I will still need
32
bytes to store any set, so it is not very useful.

Fourth attempt I made is based on my second approach. I noticed that is the set contains
n
elements it will, of course, require
n + 1
bytes (if I use my second method). But, if, for example, element
k
appeared in set (actually in array, because in my second attempt I store sets as arrays), then it cannot appear again. Basically, if
k
appears again, then it must mean something different (maybe
k - 1
). So, I did some optimizations and I noticed that I can save some bytes if I differently encode each next element (for examle
[3, 3, 5, 7]
is interpreted as set of
3
elements whose elements are
{3, 4, 5}
(every next element is decreased by its index) and
[3, 3, 5, 6]
is interpreted as
{3, 4, 2}
(notice that
3
and
4
already exists, so
6
is decreased by
2
and it becomes
4
, but
4
exists and
3
exists, so it must be
2
)). But how can this approach can actually save bytes? I experimented and realized that I can order elements in the array to make it possible, for some cases, to avoid using high bit to encode element, so I saved
1
bit per element, which is about
n / 16
bytes saved (which is
n / 2 * 1 / 8
).

Fifth approach I made is similar to my second approach, but it differently interpret number fo elements. If number of elements are less than
128
then it normally read all the lements from the following array in the memory. But, if the number fo ellements is greater than
128
then it creates a full set and then just remove elements from the following array in memory. On average, is saves a lot of bytes, but it is still far away from optimal.

My last attempt (sixth attempt) is to enumerate just some sets (for example create a list of sets which will contain: full set, set with only even numbers, set with only odd numbers, set with elements less than 128, set with elements greater than 128, etc) and then to use elements from that list and basic set operations (union, intersection, etc) to reconstruct original set. It will require few bytes for each base set we use from the list and it will require a few bits for union or intersection operations, and of course one byte for length of our sequence. It very depends on number of elements in the base set list which should be hardcoded, but it seems hard to preoperly create and properly chose elements which are in that list. Anyway, something tells me that this is not very clever approach.

But hat is actually the most optimal way? Something tells me that my fourth attempt is not so bad, but can we do better? The sets I opereate with have random number of elements, so on average
128
elements per set, so I am looking for a way to allocate
128
bits (
16
bytes) per set. The best I did so far is using my fourth approach which is far away from my goal.

Just to mention again, speed is not important. Encoding/decoding may be extremelly slow, the only important thing is that sets require as little amount of memory as possible. When I said "in memory" I meant encoded in memory (compressed). Also, I am interested in as little bits as possible (not only bytes) because I want to store billions of sets compressed on my HDD, so it is important to calculate average amount of bits I need for each set so I know how many resources are available to what I want to achieve.

P.S. If you want some code (but I really don't see why would you) I can post here my solutions I made in C for all of these approaches. Anyway, I am not asking for code or technical details how to implement this in specific programming language, I am just asking for method/algorithm for compressing sets.

Thank you in advance.

Answer Source

Your first method (and the third method, which is equivalent) is already optimal. It cannot be improved.

There are 2256 possible sets of numbers you're working with. By the pigeonhole principle, you need 2256 numbers to identify them all, and you'll need 256 bits to represent those numbers. Any method of identifying the sets which used fewer than 256 bits would leave at least one pair (and probably many pairs) of sets sharing the same identifier.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download