cat - 1 month ago 7

C Question

To learn C, I'm designing a simple object system, perhaps for a Forth-like lanugage. One of the datastructures I've designed is the hashtable, or

`hash_t`

`typedef struct {`

array_t* keys; // intelligent array object

assoc_t* vals; // array<pair>

ssize_t* idxs; // raw C array

size_t idxs_len; // and its length

} hash_t;

I've implemented it under my understanding of this description of Python 3.6's dictionaries:

`a hashtable consists of:`

non-sparse array_t of keys

non-sparse associative array of pairs of values and their key's hashes

sparse raw array of which values map to which actual entries.

{ 1: 'a', 2: 'b', 3: 'c' }

is represented in this structure as:

hash->keys = array{ 1, 2, 3 }

hash->vals = assoc{

pair{ 'a' 5 }

pair{ 'b' 7 }

pair{ 'c' 9 }

}

hash->idxs = array{ -1 -1 -1 -1 -1 0 -1 1 -1 2 -1 }

^ ^ ^

5 7 9

where 5, 7, and 9 are the hashes of 1, 2, and 3.

`-1`

`None`

The problem arises when my key,

`1`

`0x340ca71c`

`873,244,444`

`hash->idxs`

`sizeof (ssize_t) * (873,244,444 + 1)`

`8 * 873,244,444 = 6,985,955,552`

Surely, each dictionary I create in Python does not take up millions or even hundreds of thousands of bytes of RAM, yet it appears to be implemented this way, in C. What have I missed?

Answer

Decide how many buckets you want the hash to have based on the number of items that it will have, and then reduce the hash range to that range.

So if you want a hash to hold around 100,000 items with around 10 items per bucket, you want around 10,000 buckets. So after you compute the hash, take the hash modulo 10,000 to decide which bucket to put the item in.

Generally, using prime values for the bucket counts tend to work best, so perhaps 9,973.