Ben Coughlan Ben Coughlan - 6 months ago 22
Linux Question

Awk issue, duplicate lines in multiple files at once.

I've an issue with formatting output on the below.

I've duplicate lines in many files SHORT_LIST.a SHORT_LIST.b SHORT_LIST.c, but there can be many, many more.

the line "test1" exists in all three files, as does the string "sample".

The line "test" exists in two files, but exists more than once in one of the files, I'd like to have it output this just once per file name.

function check_duplicates {

awk 'END {
for (R in rec) {
#split out the SHORT_LIST files
n = split(rec[R], t, "/SHORT_LIST")
#printf n dup[n]
count = 0
if ( n > 2 )
dup[n] = dup[n] ? dup[n] RS sprintf( R, rec[R]) :
sprintf("\t%-20s %s ", R, rec[R]);
}
for (D in dup) {
((count++))
printf "%s\n \n", d
printf count " ). Duplicate record(s) found in the following files: " dup[D]
}
}
{
# build an array named rec (short for record), indexed by
# the content of the current record ($0), concatenating
# the filenames separated by / as values
rec[$0] = rec[$0] ? rec[$0] "\n \t" FILENAME : FILENAME
}' $SITEFILES

}

check_duplicates


Current output below :

Duplicate records found in the following files:

1 ). Duplicate record(s) found in the following files: test1

SHORT_LIST.a
SHORT_LIST.b
SHORT_LIST.c
sample

2 ). Duplicate record(s) found in the following files: test

SHORT_LIST.c
SHORT_LIST.b
SHORT_LIST.b
SHORT_LIST.b

3 ). Duplicate record(s) found in the following files: /path/to/file

SHORT_LIST.a
SHORT_LIST.c
testa

Desired Output below :

Duplicate records found in the following files:

1 ). Duplicate record(s) found in the following files: test1

SHORT_LIST.a
SHORT_LIST.b
SHORT_LIST.c

2 ). Duplicate record(s) found in the following files: sample

SHORT_LIST.a
SHORT_LIST.b
SHORT_LIST.c

3 ). Duplicate record(s) found in the following files: test

SHORT_LIST.c
SHORT_LIST.b

4 ). Duplicate record(s) found in the following files: /path/to/file

SHORT_LIST.a
SHORT_LIST.c

5 ). Duplicate record(s) found in the following files: testa
SHORT_LIST.a
SHORT_LIST.c

Any suggestions would be greatly appreciated, I'm having trouble with this level of AWK.

Answer
You can follow this template and fix the output format as desired

$ awk -f dups.awk fa fb fc

dups for : /path/to/file in files
fa fc
dups for : test in files
fa fb fc
dups for : sample in files
fa fb fc
no dups in
fc

$ cat dups.awk

  FNR==1{files[FILENAME]}
        {if((FILENAME, $0) in a) dupsInFile[FILENAME]
         else
           {a[FILENAME, $0]
            dups[$0] = $0 in dups ? (dups[$0] FS FILENAME) : FILENAME
            count[$0]++}}
     END{for(k in dups)
           {if(count[k] > 1)
              {print ("dups for : " k) " in files"
               print dups[k]}}
        for(f in dupsInFile) delete files[f];
        print "no dups in";
        for(f in files) printf "%s", f FS;
        printf "\n";
     }

where

$ head f{a,b,c}
==> fa <==
test
test
test1
sample
/path/to/file

==> fb <==
test
test
sample

==> fc <==
test
sample
/path/to/file

ps. always provide sample input.

Comments