KJ_ KJ_ - 5 months ago 14
Perl Question

Find matching key in two large sorted text files and compare values (VCF-files)

I am looking for an efficient solution to filter two datasets. Basically, I want to keep only lines that are not missing in one file for their key columns and do not have values "0/0" in both files.

The input data (for those interested, a genomic VCF file that I have simplified for this question) has the following characteristics:


  • Column 1 and 2 together are numerically sorted, unique identifiers

  • Column 3 starts with values 0/0, 0/1 or 1/1



The script would ideally do the following:


  1. Iterate over every line in sample1.dat and look up the same identifier in sample2.dat

  2. If an identifier from sample1.dat is not found in sample2.dat, do nothing

  3. If both lines contain "0/0", do nothing

  4. If one or both lines contain not "0/0", write both lines to their respective outputs
    .



INPUT



sample1.dat

1 1001 0/0:8:8:99:PASS
1 1002 0/0:8:8:99:PASS
1 1003 0/1:5,3:8:99:PASS,PASS
2 1234 0/0:8:8:99:PASS # not present in sample2
2 2345 1/1:8:8:99:PASS


sample2.dat

1 1001 0/0:8:8:99:PASS
1 1002 0/1:5,3:8:99:PASS,PASS
1 1003 0/0:8:8:99:PASS
2 2345 1/1:8:8:99:PASS
2 3456 0/1:8:8:99:PASS # not present in sample1


OUTPUT



sample1_out.dat

1 1002 0/0:8:8:99:PASS
1 1003 0/1:5,3:8:99:PASS,PASS
2 2345 1/1:8:8:99:PASS


sample2_out.dat

1 1002 0/1:5,3:8:99:PASS,PASS
1 1003 0/0:8:8:99:PASS
2 2345 1/1:8:8:99:PASS


In this case, 1-1001 is not printed because they both have values "0/0", and 2-1234 and 2-3456 are not printed because they are not present in both files.

Some notes:


  • The files are about 260GB, but I can easily split them into multiple files that are max 18GB (I split them in chromosomes, basically)

  • Memory available on my machine is about 128GB

  • Column 1 and 2 are together already sorted in numerical order



Any help is greatly appreciated!

Answer

awk to the rescue! Probably you need to split the files first, for each chunk you can do this

$ function f { awk -v OFS='\t' '{print $1"~"$2,$0}' $1; }; 
  join <(f file1) <(f file2) | 
  awk -v OFS='\t' '$4!~/0\/0/ || $7!~/0\/0/ 
                     {print $2,$3,$4 > "file1.out"; 
                      print $5,$6,$7 > "file2.out"}'    

Explanation Let join do the work of matching the corresponding records, but need to create a synthetic key by merging first two fields. The output carries all the info we need, apply your "0/0" logic in the corresponding fields and output sections of the result to the corresponding output files.

$ head file{1,2}.out                          

==> file1.out <==                                                                                                     
1       1002    0/0:8:8:99:PASS
1       1003    0/1:5,3:8:99:PASS,PASS
2       2345    1/1:8:8:99:PASS

==> file2.out <==
1       1002    0/1:5,3:8:99:PASS,PASS
1       1003    0/0:8:8:99:PASS
2       2345    1/1:8:8:99:PASS

You will probably better off parameterizing file names, both in and out.

Comments