Svalf Svalf - 3 months ago 10
Perl Question

BASH - Summarising information present in 2 genotype data columns in one column (ped file)

I have a PLINK ped file with 36 columns (6+30) that looks like this:

FID IID PID MID SEX PHENO SNP_1a SNP_1b SNP_2a SNP_2b SNP_3a SNP_3b SNP_4a SNP_4b SNP_5a SNP_5b SNP_6a SNP_6b SNP_7a SNP_7b SNP_8a SNP_8b SNP_9a SNP_9b SNP_10a SNP_10b SNP_11a SNP_11b SNP_12a SNP_12b SNP_13a SNP_13b SNP_14a SNP_14b SNP_15a SNP_15b
A1 A1 0 0 1 1 0 0 0 0 2 2 1 2 1 2 1 2 2 1 2 1 1 1 1 2 0 0 0 0 0 0 2 1 2 2
A2 A2 0 0 1 1 1 1 1 1 0 0 1 2 2 2 2 2 1 1 0 0 2 1 2 2 0 0 0 0 0 0 1 1 0 0
A3 A3 0 0 1 2 1 1 1 1 0 0 2 2 2 2 2 2 1 1 0 0 1 1 2 2 0 0 0 0 0 0 1 1 0 0


I am interested in modifying the genotype columns (column 7 onwards) so that:


  • If allele a and/or allele b for a SNP (SNP_#a and/or SNP_#b) is "2": summarise the 2 columns by a single column containing a "2"

  • If both alleles (a and b) for a SNP are "1": summarise it with a "1" in the single column

  • Finally, if both alleles (a and b) for a SNP are "0": summarise it with a "NA"



The output for the example above would thereby contain 21 columns (6+15) and looks like this:

FID IID PID MID SEX PHENO SNP_1 SNP_2 SNP_3 SNP_4 SNP_5 SNP_6 SNP_7 SNP_8 SNP_9 SNP_10 SNP_11 SNP_11 SNP_12 SNP_13 SNP_14 SNP_15
A1 A1 0 0 1 1 NA NA 2 2 2 2 2 2 1 2 NA NA NA 2 2
A2 A2 0 0 1 1 1 1 NA 2 2 2 1 NA 2 2 NA NA NA 1 NA
A3 A3 0 0 1 2 1 1 NA 2 2 2 1 NA 1 2 NA NA NA 1 NA


I hope someone can help me, thank you in advance!

Answer
$ cat > test.awk
NR>1{
    for(i=j=7; i<NF; i+=2)                                              # for fields 7-(NF-1)
        $(j++) = ($i$(i+1)~/2/) ? "2" : (($i$(i+1)=="11") ? "1" : "NA") # see below *)
    for (i=1; i<=21; i++)                                               # reduced to 21 fields
        printf "%-2s%s", $i,(i<21?OFS:ORS)                              # print
} 
$ awk -f test.awk test.in
A1 A1 0  0  1  1  NA NA 2  2  2  2  2  2  1  2  NA NA NA 2  2
A2 A2 0  0  1  1  1  1  NA 2  2  2  1  NA 2  2  NA NA NA 1  NA
A3 A3 0  0  1  2  1  1  NA 2  2  2  1  NA 1  2  NA NA NA 1  NA

If rules 1 (2 OR 2) or 2 (1 AND 1) fail, it returns NA.

*) Catenate a and b fields ($i$(i+1), add 2 to i on every iteration) and check them for 2 or 11 and write result to already processed cols (ie. result from fields 7 & 8 is stored to field 7, 9 & 10 to 8 etc. grow j by 1 on every iteration).