efrem efrem - 1 month ago 6
Linux Question

Extract multiple fields that contain specific words

I have a tab delimited file taht looks like that:

locus_tag="PSE_0001" codon_start=1 transl_table=11 product="Peptidase M23 M37 family protein" protein_id="AEV34513.1" db_xref="GI:359341139" translation="MVDSLASSSDQPARLNGRWLIGTILTGMTSMVLMGGALMAALDGQYTYKTAKAPASNAADLTPQRNTSGKGDRLTSATDGFSNRQIIEVNTVTRSEGRDHVKAKPYALVSASLESFKKQETAADIPPFDPITMYQGEQVAPLQVASDAIYGADIEGEVSISQRDFPLEAMSMVALPDHKEEAVQQQVKKAAMFMLDNSTDIAAIPSVEDINAGFAPLSEQSFENIEVRITEENVSFQPKSRKTTQANQIEERIVPILTQTDFIDILLDGEASETEAEGYIKAFTDNFGIDTIKAGQIFRLSLNTDQIEEDDGILVRVSIYEDQRHVGTIARNDEGEFVVAPEPTTQMAADAFNSQQQNSVGPRATYYDSIYQTGLDNEVPSSLIKELIRIYSYSVDFNASVKSGDEMSVFYGLDADQTTGASEILYTSITVNGRSHRFYRFRTPDDGVVDYYDENGQSAKQFLLRKPIAAGRFTSGFGMRRHPVLKTRRLHTGTDWAAPRGTAIFAAGDGVIQKAAWSGGYGKRVEIKHANGYVTTYNHMTRFATGIQKGQRIRQGTVIGYVGTTGLSTGNHLHYEVKVNGRFVNSLKIKVPQGRVLEAQVLENFKRERDRINALMETGRPSQRVASLRN" GenBank_acc="CP003147"; Source="Pseudovibrio sp. FO-BEG1"; feature_type="CDS"; strand="+";
locus_tag="PSE_0002" codon_start=1 transl_table=11 product="hypothetical protein" protein_id="AEV34514.1" db_xref="GI:359341140" translation="MENVLIYLVGFAGTGKLTIARALAEATSAKVVDNQWINNPIFGLLDHDRLTPYPEGVWRQIDKVREAVLETVATLGAPHASYIFTHEGFEDDASDRQIYEAIRETAQRRKARFLPVRLLCNEDEIAKRVVSPERALRLKSMDPERSRNAVRNSTVLKPNHENELTLDISDKQPADVVVLILEQVAHCKT" GenBank_acc="CP003147"; Source="Pseudovibrio sp. FO-BEG1"; feature_type="CDS"; strand="-";


I would like to extract only the fields that contain specific information:

e.g.

locus_tag
product


To obtain the following tab delimited result

locus_tag="PSE_0001" product="Peptidase M23 M37 family protein"
locus_tag="PSE_0002" product="hypothetical protein"


I tried this awk code:

awk '{for(i=1;i<=NF;i++)if ($i~/^locus_tag|^product|db_xref/) print $i}' Chrom.txt| head


But I obtained:

locus_tag="PSE_0001"
codon_start=1
transl_table=11
product="Peptidase
M23
M37
family
protein"
db_xref="GI:359341139"


Any suggestion how I can fix my code?

Answer

In your code, you don't really do what you asked for:

awk '{for(i=1;i<=NF;i++)if $1~/^locus_tag|^product|db_xref/) print $i}' Chrom.txt

you didn't asked for dbref for instance, and there is a missing parenthesis after the if. Also if your file is tab separated you should add -F"\t". Also, it breaks lines because print breaks lines after each call. So you want to use printf which do not add "\n" automatically.

Here is how I would do :

awk -F"\t" '{for (i=1;i<=NF;i++) {if($i~/locus_tag/) printf $i"\t"; if($i~/product/) printf $i"\n"}}' file

Since locus tag will appear first, I print the Field and a tab, and when I find product, I print the field and break line

Edit :

If you have more than 2 fields to extract, here with 3, you can store them in an array :

awk -F"\t" 'BEGIN{j=1}
{for (i=1;i<=NF;i++) if($i~/locus_tag|product|db_xref/) {a[j]=$i;j=j+1}}
END{for (i=1;i<=length(a);i=i+3) print a[i],a[i+1],a[i+2]}' file

locus_tag="PSE_0001" product="Peptidase M23  M37 family protein" db_xref="GI:359341139"
locus_tag="PSE_0002" product="hypothetical protein" db_xref="GI:359341140"