grw grw - 1 month ago 3x
Bash Question

Finding missing files by checksum

I'm doing a large data migration between two file systems (let's call them F1 and F2) on a Linux system which will necessarily involve copying the data verbatim into a differently-structured hierarchy on F2 and changing the file names.

I'd like to write a script to generate a list of files which are in F1 but not in F2, i.e. the ones which weren't copied by the migration script into the new hierarchy, so that I can go back and migrate them manually. Unfortunately for reasons not worth going into, the migration script can't be modified to list files that it doesn't migrate. My question differs from this previously answered one because of the fact that I cannot rely on filenames as a comparison.

I know the basic outline of the process would be:

  1. Generate a list of checksums for all files, recursing through F1

  2. Do the same for F2

  3. Compare the lists and generate a negative intersection of the checksums, ignoring the file names, to find files which are in F1 but not in F2.

I'm kind of stuck getting past that stage, so I'd appreciate any pointers on which tools to use. I think I need to use the 'comm' command to compare the list of file checksums, but since md5sum, sha512sum and the like put the file name next to the checksum, I can't see a way to get it to bring me a useful comparison. Maybe awk is the way to go?

I'm using Red Hat Enterprise Linux 5.x.



You can do something like this:

f1# find yourrootdir -type f -exec sha1sum {} >> initial_files \; 
f1# ...copy initial_files to machine f2...
f1# ...start copy...
f2# find yournewrootdir -type f -exec sha1sum {} >> final_files \;
f2# sort initial_files > INITIAL
f2# sort final_files > FINAL
f2# for sha1 in `comm -23 <(cat INITIAL | awk '{print $1}') <(cat FINAL | awk '{print $1}')`; do grep $sha1 INITIAL; done

This will show the lines in "initial_files" that don't have the SHA1 in the final_files.

The last line uses only the sha1sums to execute a comm command, then greps in initial_files each sha1sum that's missing.