Alexander Alexander - 4 years ago 199
Bash Question

Removing Duplicate Files in Unix

I want to be able to delete duplicate files and at the same time create a symbolic link to the removed duplicate lines.So far I can display the duplicate files ,the problem is removal and deleting.Since I want to retain a copy

find "$@" -type f -print0 | xargs -0 -n1 md5sum | sort --key=1,32 | uniq -w
32 -d --all-repeated=separate


1463b527b1e7ed9ed8ef6aa953e9ee81 ./tope5final
1463b527b1e7ed9ed8ef6aa953e9ee81 ./Tests/tope5

2a6dfec6f96c20f2c2d47f6b07e4eb2f ./tope3final
2a6dfec6f96c20f2c2d47f6b07e4eb2f ./Tests/tope3

5baa4812f4a0838dbc283475feda542a ./tope1bfinal
5baa4812f4a0838dbc283475feda542a ./Tests/tope1b

69d7799197049b64f8675ed4500df76c ./tope3afinal
69d7799197049b64f8675ed4500df76c ./Tests/tope3a

945fe30c545fc0d7dc2d1cb279cf9c04 ./Tests/butter6
945fe30c545fc0d7dc2d1cb279cf9c04 ./Tests/tope6

98340fa2af27c79da7efb75ae7c01ac6 ./tope2cfinal
98340fa2af27c79da7efb75ae7c01ac6 ./Tests/tope2c

d15df73b8eaf1cd237ce96d58dc18041 ./tope1afinal
d15df73b8eaf1cd237ce96d58dc18041 ./Tests/tope1a

d5ce8f291a81c1e025d63885297d4b56 ./tope4final
d5ce8f291a81c1e025d63885297d4b56 ./Tests/tope4

ebde372904d6d2d3b73d2baf9ac16547 ./tope1cfinal
ebde372904d6d2d3b73d2baf9ac16547 ./Tests/tope1c

In this case for example I want to delete ./tope1cfinal and remain with ./Tests/tope1c. After deleting I also want to create a symbolic link with name /tope1cfinal pointing to /Tests/tope1c.

Answer Source

One possibility: create an associative array, the keys of which are the md5sum, the fields of which are the corresponding first file found (the one that won't be deleted). Each time an md5sum is found in this associative array, the file will be deleted and a corresponding link to the corresponding key will be created (after checking that the file to delete isn't the original file). Takes the directories to search as arguments; with no arguments the search is performed inside current directory.


shopt -s globstar nullglob

(($#==0)) && set .

declare -A md5sum=() || exit 1;
while(($#)); do
    [[ $1 ]] || continue
    for file in "$1"/**/*; do
        [[ -f $file ]] || continue
        h=$(md5sum < "$file") || continue
        read h _ <<< "$h" # This line is optional: to remove the hyphen in the md5sm
        if [[ ${md5sum[$h]} ]]; then
            # already seen this md5sum
            [[ "$file" -ef "${md5sum[$h]}" ]] && continue # prevent unwanted removal!
            rm -- "$file" || continue
            ln -rs -- "${md5sum[$h]}" "$file"
            # first time seeing this file

(Untested, use at your own risks!)

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download