mossaab mossaab - 4 months ago 57
Linux Question

Merge sort gzipped files

I have 40 files of 2GB each, stored on an NFS architecture. Each file contains two columns: a numeric id and a text field. Each file is already sorted and gzipped.

How can I merge all of these files so that the resulting output is also sorted?

I know

sort -m -k 1
should do the trick for uncompressed files, but I don't know how to do it directly with the compressed ones.

PS: I don't want the simple solution of uncompressing the files into disk, merging them, and compressing again, as I don't have sufficient disk space for that.


This is a use case for process substitution. Say you have two files to sort, sorta.gz and sortb.gz. You can give the output of gunzip -c FILE.gz to sort for both of these files using the <(...) shell operator:

sort -m -k1 <(gunzip -c sorta.gz) <(gunzip -c sortb.gz) >sorted

Process substitution substitutes a command with a file name that represents the output of that command, and is typically implemented with either a named pipe or a /dev/fd/... special file.

For 40 files, you will want to create the command with that many process substitutions dynamically, and use eval to execute it:

cmd="sort -m -k1 "
for input in file1.gz file2.gz file3.gz ...; do
    cmd="$cmd <(gunzip -c '$input')"
eval "$cmd" >sorted       # or eval "$cmd" | gzip -c > sorted.gz