allan.simon allan.simon - 1 month ago 4
Python Question

Postgresql/Python compressing text column with a lot of redundant rows


  • We're using Amazon RDS, so we're very limited on the number of PostgreSQL extensions we can use.

  • I've said PostgreSQL on RDS , because 1) we must use AWS, 2) we want the safest solution on data integrity with lowest time spent on maintenance. (so if an other service is better suited with these requirements, we're open to suggestion)

  • we have several terabyte of said data, so space efficiency is important (but we can easily shard based on the source)

We would like to store "logs" in a table with the following minimal set of fields (more maybe added for optimization purpose):

  • source

  • time

  • level

  • message

The message column has the following specificity
* 99% of the time very short (< 64 chars, but for the exceptions they can be quite long > 1024 chars)
* some of the well identified message can take up to 10% of the number of messages
* a lot of "nearly" duplicated messages (i.e like
this system hs been up and running for X seconds

* a long tail of "unique" messages
* taking the messages of a typical day and running them through gzip easily divided the size by 15

Right now i'm thinking of two possible solution

Compression algorithm accepting dictionary of user defined "tokens"

The idea we have would be to have some kind of compression algorithm that can use a user provided "dictionnary" based on our identified list of "repeated text" and storing the result. as our "app" is the only one to write and read, we would be able to decompress "on the fly"

  • Pro: will permit certainly to have a pretty good compression ratio

  • cons: I don't know where to search (LZ77 ? but i don't see how)

Dictionnary table for "exactly" matching predefined messages


  • source

  • time

  • level

  • dictionnary_id

  • message (nullable)



  • dictionnary_id

  • message


  • Pro: I perfectly see how to implement it, and it's easy to "update"

  • cons: Does not cover the "near" match

Is there an already "state of the art" solution for this kind of problem ?


I finally went with the dictionnary table idea

for the dictionnary_id I actually used a murmurhash 64 bits (and named the id hash_message so that I can precompute it on the python side first, and as it is a non-crytographic, it's pretty to call, for python there's a pure C implementation of it mmh3 module).

I preloaded the dictionnary table with the 500k most common (and duplicated) logs out of the 507 millions messages I got from a "typical day" of logs. Then I loaded the data doing the following INSERT request:

        INSERT INTO my_logs (
            NULLIF (
        FROM dictionnary
        WHERE hash_message = %(hash_message)s;

by doing so, the insert automatically if the message is already existing, and only insert the actual text in my log table if it's not present in the dictionnary.

With I got an average of only 3 bytes used by the message column of my log table !! which mean that most of the time the message table is null, and though adding more logs in my dictionnary would not be worth it (and that switching from a bigint for my service_id to a int would actually be a better idea)