cat cat - 9 months ago 37
C Question

How to design a hashtable that doesn't use so much memory?

To learn C, I'm designing a simple object system, perhaps for a Forth-like lanugage. One of the datastructures I've designed is the hashtable, or


typedef struct {
array_t* keys; // intelligent array object
assoc_t* vals; // array<pair>

ssize_t* idxs; // raw C array
size_t idxs_len; // and its length

} hash_t;

I've implemented it under my understanding of this description of Python 3.6's dictionaries:

a hashtable consists of:
non-sparse array_t of keys
non-sparse associative array of pairs of values and their key's hashes
sparse raw array of which values map to which actual entries.

{ 1: 'a', 2: 'b', 3: 'c' }

is represented in this structure as:

hash->keys = array{ 1, 2, 3 }
hash->vals = assoc{
pair{ 'a' 5 }
pair{ 'b' 7 }
pair{ 'c' 9 }
hash->idxs = array{ -1 -1 -1 -1 -1 0 -1 1 -1 2 -1 }
^ ^ ^
5 7 9

where 5, 7, and 9 are the hashes of 1, 2, and 3.

takes the place of
in the Python post to indicate a nonexistent value.

The problem arises when my key,
, (stringified) hashes to
, or
. Thus, the array of key indicies (
) needs to be
sizeof (ssize_t) * (873,244,444 + 1)
, or
8 * 873,244,444 = 6,985,955,552
bytes, or ~7 GB, which is more RAM than my laptop has, and also more RAM than I want one hashtable to take up.

Surely, each dictionary I create in Python does not take up millions or even hundreds of thousands of bytes of RAM, yet it appears to be implemented this way, in C. What have I missed?

Answer Source

Decide how many buckets you want the hash to have based on the number of items that it will have, and then reduce the hash range to that range.

So if you want a hash to hold around 100,000 items with around 10 items per bucket, you want around 10,000 buckets. So after you compute the hash, take the hash modulo 10,000 to decide which bucket to put the item in.

Generally, using prime values for the bucket counts tend to work best, so perhaps 9,973.