File under: "Unexpected Efficiency Dept."
The first 90 million numbers take up about 761MB, as output by:
seq 90000000 | parallel --pipe --recend '' -k gzip -9 >bigfile.gz
seq 90000000 | gzip -9 > bigfile2.gz
ls -log bigfile*.gz
-rw-rw-r-- 1 200016306 Jul 3 17:27 bigfile.gz
-rw-rw-r-- 1 200381681 Jul 3 17:30 bigfile2.gz
The reason is that for this particular, rather unusual input, smaller deflate blocks are better than larger ones. By default gzip uses larger deflate blocks, since that works best for normal input data. The parallel command is forcing a few smaller deflate blocks by breaking up the input every 1 MB, resulting in a small gain. Though most of the blocks are still the same size.
You can do much better by setting a smaller block size for every block by using zlib's
memLevel parameter in
deflateInit2(). Here I compress the same output in a single thread each time, using
memLevel values from 9 to 2, where a smaller
memLevel is a smaller deflate block size (note that zlib does a little better than your gzip at the default level):
memLevel for this data turns out to be 4, for which the compressed data is 12 MB (9%) smaller than for the default
memLevel of 8. For
memLevel 8, the deflate block size is 16383 symbols, whereas for
memLevel 4, the deflate block size is 1023 symbols. One symbol is either a literal byte or a match.
The improvement comes from the extremely regular nature of the input, resulting in a regular sequence of match and literal commands. The smaller the block size, the fewer such distinct commands that appear, which then takes fewer bits to code each of them. This is still true for
memLevel 3, but by then the overhead of the code description at the beginning of each deflate block cancels the improvement from fewer distinct codes.