Geroge Geroge - 10 months ago 44
Python Question

Parsing data in awk

I have a genetics data like this:

MUT1 G_->_A_(het) 44%_(96)___[45%_(49)_/_43%_(47)] rs1799967_(Gene_file;_1000Genomes;_ClinVarVCF;_dbSNP,MutDB) c.4956G>A 1
MUT1 A_->_G_(homo) 99%_(297)___[99%_(151)_/_99%_(146)] rs206075_(Gene_file;_1000Genomes;_ClinVarVCF;_dbSNP) c.4563A>G 1
MUT1 G_->_C_(homo) 100%_(259)___[100%_(132)_/_100%_(127)] COSM4147689_(COSMIC),_COSM4147690_(COSMIC),_rs206076_(Gene_file;_1000Genomes;_ClinVar;_ClinVarVCF;_dbSNP) c.6513G>C 2
MUT1 A_->_C_(het) 41%_(103)___[42%_(53)_/_40%_(50)] COSM3753646_(COSMIC),_COSM147663_(COSMIC),_rs144848_(Gene_file;_1000Genomes;_ClinVarVCF;_dbSNP,MutDB) c.1114A>C 5

What I need to parse this data and extract only some kind of fields.

Require output is:

MUT1 het 44% rs1799967 c.4956G>A 1
MUT1 homo 99% rs206075c.4563A>G 1
MUT1 homo 100% rs206076 c.6513G>C 2
MUT1 het 41% rs144848 c.1114A>C 5

So output shoould be - all first column, from second column only het or hom, third column is only %, fifth column should be extracted only rs_number - this have always different position and last column.

note: I know, that info about homo/het is always in last filed of second column. And % is always on first field in third column.

My solution is:

awk -v OFS="\t" '{print $1,$5,$6,$9,$10,$11}' zkouska.csv | awk -v OFS="\t" 'NR>1{split($2,arr2,"_"); split($3,arr3,"_"); print $1,arr2[4],arr3[1],$4,$5,$6}'

But output is :

BRCA1 (het) 44% rs1799967_(Gene_file;_1000Genomes;_ClinVarVCF;_dbSNP,MutDB) c.4956G>A 1
BRCA1 (homo) 99% rs206075_(Gene_file;_1000Genomes;_ClinVarVCF;_dbSNP) c.4563A>G 1
BRCA1 (homo) 100% COSM4147689_(COSMIC),_COSM4147690_(COSMIC),_rs206076_(Gene_file;_1000Genomes;_ClinVar;_ClinVarVCF;_dbSNP) c.6513G>C 2
BRCA1 (het) 41% COSM3753646_(COSMIC),_COSM147663_(COSMIC),_rs144848_(Gene_file;_1000Genomes;_ClinVarVCF;_dbSNP,MutDB) c.1114A>C 5
BRCA1 (homo) 100% COSM148277_(COSMIC),_COSM3755561_(COSMIC),_rs16942_(Gene_file;_1000Genomes;_ClinVarVCF;_dbSNP) c.3548A>G 5

Still have problem to extract rs from fifth column. Erase quotes in second field. Inpout and output should be TAB separate. Solution could no be only in awk.

Answer Source

Using a combination of gsub and match might be the way to go, here is a portable example:


  gsub(/^[^(]+\(|\)/, "", $2)
  gsub(/_.*/, "", $3)
  match($4, /rs[0-9]+/)
  print $1, $2, $3, substr($4, RSTART, RLENGTH), $5, $6

Run it like this:

awk -f parse.awk OFS='\t' < infile


MUT1    het     44%     rs1799967   c.4956G>A   1
MUT1    homo    99%     rs206075    c.4563A>G   1
MUT1    homo    100%    rs206076    c.6513G>C   2
MUT1    het     41%     rs144848    c.1114A>C   5