this system hs been up and running for X seconds
I finally went with the dictionnary table idea
for the dictionnary_id I actually used a
murmurhash 64 bits (and named the id
hash_message so that I can precompute it on the python side first, and as it is a non-crytographic, it's pretty to call, for python there's a pure C implementation of it
I preloaded the dictionnary table with the 500k most common (and duplicated) logs out of the 507 millions messages I got from a "typical day" of logs.
Then I loaded the data doing the following
INSERT INTO my_logs ( service_id, creation_date, level, hash_message, message ) SELECT %(hash_robot)s, %(creation_date)s, %(hash_message)s, NULLIF ( %(message)s, min(message) ) FROM dictionnary WHERE hash_message = %(hash_message)s;
by doing so, the insert automatically if the
message is already existing, and only insert the actual text in my log table if it's not present in the dictionnary.
With I got an average of only 3 bytes used by the
message column of my log table !! which mean that most of the time the message table is null, and though adding more logs in my dictionnary would not be worth it (and that switching from a
bigint for my service_id to a
int would actually be a better idea)