Varun M Varun M - 2 months ago 7
Bash Question

Finding items that are common to all the input files

I have a series of files of the type-

f1.txt f2.txt f3.txt
A B A
B G B
C H C
D I E
E L G
F M J


I want to find out the entries that are common to all three files. In this case the expected output would be
B
since that is the only letter that occurs is all three files.

If I had just two files, I could find out the common entries using
comm -1 -2 f1.txt f2.txt
.

But that doesn't work with multiple files. I thought about something like

sort -u f*.txt > index #to give me the total unique entries


while read i ; do *test if entry is present in all the files* ; done < index


I thought of iteratively doing the
comm -12 f1.txt f2.txt | comm -12 - f3.txt
but I have 100+ files so that's not practical. Performance does matter.

EDIT

I implemented the following-

sort -u f* > index

while read i
do
echo -n "$i "
grep -c "$i" f*.txt > temp
awk -F ":" '{a+=$2} END {print a}' temp
done < index | sort -rnk2


This gives the output-

B 3
G 2
E 2
C 2
A 2
M 1
L 1
J 1
I 1
H 1
F 1
D 1


From here I can see that the number of files is 3 and the occurrence of
B
is 3. Hence it occurs in all the files. I'm still looking for a better solution though.

Answer
awk '{cnt[$0]++} END{for (i in cnt) if (cnt[i]==(ARGC-1)) print i}' *.txt

The above assumes each value occurs no more than once in a given file, like in your example. If a value CAN occur multiple times in one file then:

awk '!seen[FILENAME,$0]++{cnt[$0]++} END{for (i in cnt) if (cnt[i]==(ARGC-1)) print i}' *.txt

or with GNU awk for true multi-dimensional arrays and ARGIND:

awk '{cnt[$0][ARGIND]} END{for (i in cnt) if (length(cnt[i])==ARGIND) print i}' *.txt
Comments