Jan Shamsani Jan Shamsani - 4 months ago 12
Bash Question

match multiple columns from gzip files

I would like to match multiple columns between file1.txt and file2.gz without unzipping file2.

file 1.txt:
1 11710779 -
1 12919623 CC


file2.gz:

1 13380 . C G 7829.15 VQSRTrancheSNP99.60to99.80 AC=30;AC_AFR=14;AC_AMR=1;AC_Adj=15;AC_EAS=0;AC_FIN=0
1 13382 . C G 320.40 VQSRTrancheSNP99.60to99.80 AC=3;AC_AFR=0;AC_AMR=0;AC_Adj=1;AC_EAS=0;AC_FIN=0;AC
1


I want to match $1,$2,3 in file1.txt with $1,$2,$4 in file.gz and return all lines in file2.

I tried

awk -F '\t' 'NR==FNR{c[$1$2$4]++;next};c[$1$2$3] > 0' file2.gz file1.txt


and

awk -F '\t' 'NR==FNR{a[$1,$2,$3]++;next} (a[$1,$2,$4])' file1.txt file2.gz


Both commands did not work. Contents in file1 exist in file2 when I grep some of them individually.
I'm not sure if I need to unzip file2 first before running the command. I can't unzip the file as it's too big.

Answer
zcat file2.txt.gz | awk -F '\t'  'NR==FNR{a[$1,$2,$3]++;next} a[$1,$2,$4]' file1.txt -

The two file arguments to awk are file1.txt and -. The second file, -, tells awk to read from stdin where we have piped in file2.txt.gz

Example

Let's consider these two sample files:

$ cat file1.txt
1       11710779        -
1       12919623        CC
1       13382   C

And:

$ zcat file2.txt.gz
1       13380   .       C       G       7829.15 VQSRTrancheSNP99.60to99.80      AC=30;AC_AFR=14;AC_AMR=1;AC_Adj=15;AC_EAS=0;AC_FIN=0
1       13382   .       C       G       320.40  VQSRTrancheSNP99.60to99.80      AC=3;AC_AFR=0;AC_AMR=0;AC_Adj=1;AC_EAS=0;AC_FIN=0;AC

Now, let's run our command:

$ zcat file2.txt.gz | awk -F '\t'  'NR==FNR{a[$1,$2,$3]++;next} a[$1,$2,$4]' file1.txt -
1       13382   .       C       G       320.40  VQSRTrancheSNP99.60to99.80      AC=3;AC_AFR=0;AC_AMR=0;AC_Adj=1;AC_EAS=0;AC_FIN=0;AC