everestial007 everestial007 - 1 year ago 60
Python Question

How can I do context depended substitution (updating of fields) in the vcf (variant call format) files?

I have a vcf file that looks like this:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 2ms01e 2ms02g 2ms03g 2ms04h

2 15882505 . T A 12134.90 PASS AC=2;AF=0.250;AN=8;BaseQRankSum=-0.021;ClippingRankSum=0.000;DP=695;ExcessHet=3.6798;FS=0.523;MLEAC=2;MLEAF=0.250;MQ=60.00;MQRankSum=0.000;QD=25.18;ReadPosRankSum=1.280;SOR=0.630 GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC 0/1:59,89:148:99:3620,0,2177:1|0:.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.:1452:|:0.5 0/1:125,209:334:99:8549,0,4529:.:.:.:.:. 0/0:130,0:130:99:0,400,5809:.:.:.:.:. 0/0:82,0:82:99:0,250,3702:.:.:.:.:.

2 15882583 . G T 1221.33 PASS AC=1;AF=0.125;AN=8;BaseQRankSum=-2.475;ClippingRankSum=0.000;DP=929;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.125;MQ=60.00;MQRankSum=0.000;QD=9.25;ReadPosRankSum=0.299;SOR=0.686 GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC 0/0:178,0:178:99:0,539,7601:0/0:.:.:0/0:. 0/0:446,0:446:99:0,1343,16290:.:.:.:.:. 0/0:172,0:172:99:0,517,6205:.:.:.:.:. 0/1:75,57:132:99:1253,0,2863:.:.:.:.:.

The very first line is the header (which has other header information infront of it, its removed in here) and the columns are tab separated.

For convenience with understanding the data structure I am sharing a sub-sample of the file in this link (on dropbox which can be downloaded): https://www.dropbox.com/sh/coihujii38t5prd/AABDXv8ACGIYczeMtzKBo0eea?dl=0

Please download the file which is only about 300 Kb and can be open via text editors. This will help to understand the problem better.


I need to do a context dependent substitution (value updates).
- All the header information (tagged with # infront of the line) needs to stay the same.

  • The values for different lines corresponds to the very last header (i.e. CHROM POS ID ....)

  • First of all we need to look into the PG (phased genotype) field in the FORMAT column. Different tags for the fields are separated by ":". And there is a corresponding value for that particular field in the SAMPLE column (which is 2ms01e, for now). So, for the first line the PG value for sample (2ms01e) is 1|0.

  • Now, we will need to look into the GT field in the FORMAT column in the same line and update its corresponding value to the same value as PG. i.e. change 0/1 to 1|0. Its important to keep the order as is in PG (if its 1|0 or 0|1, it needs to be exact).

But, if the PG field has it values 0/1, 0/0, 1/0 or any other (i.e has a slash to it) the GT filed need not be changed (or updated).

Final output:
So, the GT value from the first line of data should change from:





You can see here that only the value for GT field has changed while all the other field values stay the same.

For the second line the GT value stays the same - i.e. 0/0 to 0/0, since the PG value for this line is 0/0 (same at the GT value, so no change).

Easy method:
I think its best if the value from PG field can be copy-pasted to the GT field values in the SAMPLE (2ms01e) column. The GT field value is 1st position and PG filed is 6th position, with different fields separated by ":". So, all we need to do is update the value in the first field with the values from the 6th field.

This easy method might work, since when PG has slash "/" GT will have slash too and order doesn't matter.
But I am not sure if it will work for every line. But, this would be an easy solution, and I can at least check and make sure if it worked.

Hard method:
If easy method does not work as expected I think considering every context becomes important.


Is PG field value with a pipe (|). If yes it needs to be changed.

If there is no PG field in the FORMAT column - then skip it.

The separator of the fields in FORMAT column is ":" .So, is in SAMPLE column. So, counting the distance between field is important. GT and PG fields are 1st and 6th position.

Any kind of solution is appreciated but if its python its better, so I can read and manipulate it if my context changes.
Also, explanation of the given solution would help a great deal.

Thanks in advance and sorry for being so picky. I have very moderate computer skills but still not with programming.

Cheers ! :))


Answer Source
$ cat > another_mess.awk  
$0!="" {
    n=split($10,a,":")               # split $10 at ":" to a array
    if(substr(a[6],2,1)=="|") {      # if "|" in PG
        a[1]=a[6]                    # copy PG to GT
        $10=""                       # empty $10
        for(i=1;i<=n;i++)            # rebuild $10
            $10=$10a[i] (i<n?":":"") # use ":" as delimiter
    print $10           # PRINT $10 TO TEST, CHANGE TO $0                       

$ awk -f another_mess.awk mess.in
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download