agc agc - 4 months ago 14
Linux Question

Why would gnu parallel chunking improve gzip's compression size?

File under: "Unexpected Efficiency Dept."

The first 90 million numbers take up about 761MB, as output by:

seq 90000000


According to
man parallel
, it can speed up
gzip
's archiving big files by chopping the input up, and using different CPUs to compress the chunks. So even though
gzip
is single-threaded this technique makes it multi-threaded:

seq 90000000 | parallel --pipe --recend '' -k gzip -9 >bigfile.gz


Took 46 seconds, on an Intel Core i3-2330M (4) @ 2.2GHz.

Pipe that to plain old
gzip
:

seq 90000000 | gzip -9 > bigfile2.gz


Took 80 seconds, on the same CPU. Now the surprise:

ls -log bigfile*.gz


Output:

-rw-rw-r-- 1 200016306 Jul 3 17:27 bigfile.gz
-rw-rw-r-- 1 200381681 Jul 3 17:30 bigfile2.gz


300K larger? That didn't look right. First I checked with
zdiff
if the files had the same contents -- yes, the same. I'd have supposed any compressor would do better with a continuous data stream than a chunked one. Why isn't
bigfile2.gz
smaller than
bigfile.gz
?

Answer

The reason is that for this particular, rather unusual input, smaller deflate blocks are better than larger ones. By default gzip uses larger deflate blocks, since that works best for normal input data. The parallel command is forcing a few smaller deflate blocks by breaking up the input every 1 MB, resulting in a small gain. Though most of the blocks are still the same size.

You can do much better by setting a smaller block size for every block by using zlib's memLevel parameter in deflateInit2(). Here I compress the same output in a single thread each time, using memLevel values from 9 to 2, where a smaller memLevel is a smaller deflate block size (note that zlib does a little better than your gzip at the default level):

  • 9 - 199688429
  • 8 - 198554111 (default)
  • 7 - 191582070
  • 6 - 184880482
  • 5 - 181295029
  • 4 - 180137425 (optimum for this input)
  • 3 - 181176610
  • 2 - 185759115

The optimum memLevel for this data turns out to be 4, for which the compressed data is 12 MB (9%) smaller than for the default memLevel of 8. For memLevel 8, the deflate block size is 16383 symbols, whereas for memLevel 4, the deflate block size is 1023 symbols. One symbol is either a literal byte or a match.

The improvement comes from the extremely regular nature of the input, resulting in a regular sequence of match and literal commands. The smaller the block size, the fewer such distinct commands that appear, which then takes fewer bits to code each of them. This is still true for memLevel 3, but by then the overhead of the code description at the beginning of each deflate block cancels the improvement from fewer distinct codes.

Comments