Umesh Kacha Umesh Kacha - 26 days ago 6
Bash Question

Finding largest file size recursively from a directory

Hi I have one dir which contains thousands of .gz files. Now I want to find the largest uncompressed file size without unzipping it. For e.g dir1 has 1.gz,2.gz,3.gz and so on and I want to find the largest uncompressed file size without uncompressing it

I tried the following command but it is not working

find . -type f -name '*.gz' | xargs zcat | xargs ls -1s


Please guide. I am new to bash and linux. Thanks in advance.

Answer

Interrestingly, according to http://www.gzip.org/zlib/rfc-gzip.html

ISIZE (Input SIZE)
   This contains the size of the original (uncompressed) input data modulo 2^32. 

So the format contains the original size (modulo 2^32, which "ought to be enough for anybody", but of course is not... See warnings below!)... Now we just need a command to output it for us : gzip -l file(s) : the size is the 2nd argument.

Therefore you DO NOT NEED to uncompress the files at all IF your original files were all less than 4gb in size:

find . -name '*.gz' -print | xargs gzip -l | awk '{ print $2, $4 ;}'  | grep -v '(totals)$' | sort -n | tail -1

Which will be a great deal faster than the others solutions I see here ^^

BUT please be warned: for files of size greater than 2^32 , the result will be only "modulo 2^32" (so for example, a file of size "2^32 + 1" bytes will be reported as having a size of 1 byte!). So if you have compressed files that were originally larger than 4Gb, you need to uncompress (on-the-fly if you want) to get their real size!

Edit: I tried to see if the ratio could be used instead of the "original size modulo 2^32" : no...

$ dd if=/dev/zero of=1_gb bs=1048576  count=1024    #creating a 1 Gb file
$ dd if=/dev/zero of=5_gb bs=1048576  count=5120    #creating a 5 Gb file
$ ls -al *gb*
-rw-r--r--    1 user  UsersGrp   1042074 Mar  4 10:30 1_gb.gz
-rw-r--r--    1 user  UsersGrp   5210215 Mar  4 10:28 5_gb.gz
$ gzip -l *gb*
compressed        uncompressed  ratio uncompressed_name
   1042074          1073741824  99.9% 1_gb
   5210215          1073741824  99.5% 5_gb   
   6252289          2147483648  99.7% (totals)

 (notice the 2nd: the uncompressed is not 5gb, but 1gb, as it's modulo 2^32 (=4gb) :( )

=> the ratio is unuseable too for files >4gb... ( 5gb/5210215 = 1030 . 1gb/1042074 = 1030 too. So the ratio should be the same. But it seems the ratio is using the "uncompressed" field, and not the original size itself.)