eleanora eleanora - 3 months ago 11
Perl Question

Filter a smaller file using another huge file

I have a huge csv file with about 10^9 lines where each line has a pair of ids such as:

IDa,IDb
IDb,IDa
IDc,IDd


Call this file1. I have another much smaller csv file with about 10^6 lines in the same format. Call this file2.

I want to simply find the lines in file2 which contain at least one ID that exists somewhere in file1.

Is there a fast way to do this? I don't mind if it is in awk, python or perl.

Answer
$ cat > file2 # make test file2
IDb,IDa
$ awk -F, 'NR==FNR{a[$1];a[$2];next} ($1 in a&&++a[$1]==1){print $1} ($2 in a&&++a[$2]==1){print $2}' file2 file1 > file3
$ cat file3 # file2 ids in file1 put to file3
IDa
IDb
$ awk -F, 'NR==FNR{a[$1];next} ($1 in a)||($2 in a){print $0}' file3 file2
IDb,IDa