Chris Chris - 6 months ago 10
Perl Question

Analyse text records based on the value of two fields



I am using Perl to test a couple of conditions in each line of an input file. The one-liner below works for most records but not all.

In the current output, lines 2,3, and 5 are correct, but lines 1 and 4 are not, presumably because the

STB
value has two comma-separated values in it instead of one. For instance
STB=0.5,0.645036;
instead of
STB=0.590597;
.

I can not seem to figure out how to apply the same logic to both conditions, that is if
STB >= 0.8
, then "STRAND BIAS" and the
reads
are the value of the
FDP
field.

The input file will have some lines in it with one
STB
value and also some with two.

Perl



perl -ple '/^[^#].*FDP=(\d+);.*STB=(\d+\.\d+);/ and $_.=($2 >= 0.8?" STRAND BIAS ":" GOOD ").$1." reads"' input > out


Input



chr1 93159358 . CT AC,C 51.3482 PASS AF=0,0.538462;AO=4,12;DP=39;FAO=0,21;FDP=39;FR=.;FRO=18;FSAF=0,11;FSAR=0,10;FSRF=15;FSRR=3;FWDB=0.0379899,0.0954749;FXX=0;HRUN=1,5;LEN=2,1;MLLD=22.441,10.1519;OALT=AC,-;OID=.,.;OMAPALT=AC,C;OPOS=93159358,93159359;OREF=CT,T;PB=0.5,0.5;PBP=1,1;QD=5.26648;RBI=0.0698716,0.219287;REFB=-0.0299799,-0.0774582;REVB=0.0586414,0.197411;RO=22;SAF=0,9;SAR=4,3;SRF=17;SRR=5;SSEN=0,0;SSEP=0,0;SSSB=-0.747246,-0.0336118;STB=0.5,0.645036;STBP=1,0.086;TYPE=mnp,del;VARB=0.059091,0.135819;ANN=EVI5 GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 0/2:46:39:39:22:18:4,12:0,21:0,0.538462:4,3:0,9:17:5:0,10:0,11:15:3
chr1 93073228 . C A 142.937 PASS AF=0.4;AO=42;DP=105;FAO=42;FDP=105;FR=.;FRO=63;FSAF=25;FSAR=17;FSRF=28;FSRR=35;FWDB=-0.00213313;FXX=0.00943307;HRUN=2;LEN=1;MLLD=178.966;OALT=A;OID=.;OMAPALT=A;OPOS=93073228;OREF=C;PB=0.5;PBP=1;QD=5.44523;RBI=0.00753887;REFB=-0.0179184;REVB=-0.00723079;RO=63;SAF=25;SAR=17;SRF=28;SRR=35;SSEN=0;SSEP=0;SSSB=0.159972;STB=0.590597;STBP=0.144;TYPE=snp;VARB=0.0207923;ANN=EVI5 GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 0/1:142:105:105:63:63:42:42:0.4:17:25:28:35:17:25:28:35
chr1 93089823 . T C 1038.33 PASS AF=1;AO=110;DP=111;FAO=111;FDP=111;FR=.;FRO=0;FSAF=76;FSAR=35;FSRF=0;FSRR=0;FWDB=0.0247073;FXX=0.00892777;HRUN=2;LEN=1;MLLD=59.5565;OALT=C;OID=.;OMAPALT=C;OPOS=93089823;OREF=T;PB=0.5;PBP=1;QD=37.4173;RBI=0.025038;REFB=-0.0649256;REVB=-0.0040564;RO=1;SAF=75;SAR=35;SRF=1;SRR=0;SSEN=0;SSEP=0;SSSB=-0.00628837;STB=0.5;STBP=1;TYPE=snp;VARB=0.000880627;ANN=EVI5 GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 1/1:42:111:111:1:0:110:111:1:35:75:1:0:35:76:0:0
chr11 36596027 . AG AA,A 1031.71 PASS AF=0.121875,0.703125;AO=52,118;DP=333;FAO=39,225;FDP=320;FR=.;FRO=56;FSAF=2,136;FSAR=37,89;FSRF=14;FSRR=42;FWDB=0.0148693,0.00188064;FXX=0.0615818;HRUN=5,5;LEN=1,1;MLLD=11.6837,10.3394;OALT=A,-;OID=.,.;OMAPALT=AA,A;OPOS=36596028,36596028;OREF=G,G;PB=0.5,0.5;PBP=1,1;QD=12.8964;RBI=0.065829,0.11083;REFB=-0.0510698,-0.110624;REVB=0.0641277,0.110814;RO=85;SAF=2,84;SAR=50,34;SRF=17;SRR=68;SSEN=0,0;SSEP=0,0.328125;SSSB=-0.504054,0.394265;STB=0.789128,0.571642;STBP=0.007,0;TYPE=snp,del;VARB=0.0245642,0.134756;ANN=RAG1 GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 0/2:42:333:320:85:56:52,118:39,225:0.121875,0.703125:50,34:2,84:17:68:37,89:2,136:14:42
chr11 95825383 . C T 143.023 PASS AF=0.47561;AO=28;DP=71;FAO=39;FDP=82;FR=.;FRO=43;FSAF=6;FSAR=33;FSRF=40;FSRR=3;FWDB=-0.0301041;FXX=0.0238067;HRUN=1;LEN=1;MLLD=189.321;OALT=T;OID=.;OMAPALT=T;OPOS=95825383;OREF=C;PB=0.5;PBP=1;QD=6.97675;RBI=0.153139;REFB=-0.0165525;REVB=0.150151;RO=43;SAF=6;SAR=22;SRF=40;SRR=3;SSEN=0;SSEP=0;SSSB=-0.666847;STB=0.875258;STBP=0;TYPE=snp;VARB=0.0275999;ANN=MAML2 GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 0/1:143:71:82:43:43:28:39:0.47561:22:6:40:3:33:6:40:3


Current output (lines 2,3, and 5 are correct)



line 1 STB=0.5,0.645036
line 4 STB=0.789128,0.571642




chr1 93159358 . CT AC,C 51.3482 PASS AF=0,0.538462;AO=4,12;DP=39;FAO=0,21;FDP=39;FR=.;FRO=18;FSAF=0,11;FSAR=0,10;FSRF=15;FSRR=3;FWDB=0.0379899,0.0954749;FXX=0;HRUN=1,5;LEN=2,1;MLLD=22.441,10.1519;OALT=AC,-;OID=.,.;OMAPALT=AC,C;OPOS=93159358,93159359;OREF=CT,T;PB=0.5,0.5;PBP=1,1;QD=5.26648;RBI=0.0698716,0.219287;REFB=-0.0299799,-0.0774582;REVB=0.0586414,0.197411;RO=22;SAF=0,9;SAR=4,3;SRF=17;SRR=5;SSEN=0,0;SSEP=0,0;SSSB=-0.747246,-0.0336118;STB=0.5,0.645036;STBP=1,0.086;TYPE=mnp,del;VARB=0.059091,0.135819;ANN=EVI5 GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 0/2:46:39:39:22:18:4,12:0,21:0,0.538462:4,3:0,9:17:5:0,10:0,11:15:3
chr1 93073228 . C A 142.937 PASS AF=0.4;AO=42;DP=105;FAO=42;FDP=105;FR=.;FRO=63;FSAF=25;FSAR=17;FSRF=28;FSRR=35;FWDB=-0.00213313;FXX=0.00943307;HRUN=2;LEN=1;MLLD=178.966;OALT=A;OID=.;OMAPALT=A;OPOS=93073228;OREF=C;PB=0.5;PBP=1;QD=5.44523;RBI=0.00753887;REFB=-0.0179184;REVB=-0.00723079;RO=63;SAF=25;SAR=17;SRF=28;SRR=35;SSEN=0;SSEP=0;SSSB=0.159972;STB=0.590597;STBP=0.144;TYPE=snp;VARB=0.0207923;ANN=EVI5 GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 0/1:142:105:105:63:63:42:42:0.4:17:25:28:35:17:25:28:35 GOOD 105 reads
chr1 93089823 . T C 1038.33 PASS AF=1;AO=110;DP=111;FAO=111;FDP=111;FR=.;FRO=0;FSAF=76;FSAR=35;FSRF=0;FSRR=0;FWDB=0.0247073;FXX=0.00892777;HRUN=2;LEN=1;MLLD=59.5565;OALT=C;OID=.;OMAPALT=C;OPOS=93089823;OREF=T;PB=0.5;PBP=1;QD=37.4173;RBI=0.025038;REFB=-0.0649256;REVB=-0.0040564;RO=1;SAF=75;SAR=35;SRF=1;SRR=0;SSEN=0;SSEP=0;SSSB=-0.00628837;STB=0.5;STBP=1;TYPE=snp;VARB=0.000880627;ANN=EVI5 GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 1/1:42:111:111:1:0:110:111:1:35:75:1:0:35:76:0:0 GOOD 111 reads
chr11 36596027 . AG AA,A 1031.71 PASS AF=0.121875,0.703125;AO=52,118;DP=333;FAO=39,225;FDP=320;FR=.;FRO=56;FSAF=2,136;FSAR=37,89;FSRF=14;FSRR=42;FWDB=0.0148693,0.00188064;FXX=0.0615818;HRUN=5,5;LEN=1,1;MLLD=11.6837,10.3394;OALT=A,-;OID=.,.;OMAPALT=AA,A;OPOS=36596028,36596028;OREF=G,G;PB=0.5,0.5;PBP=1,1;QD=12.8964;RBI=0.065829,0.11083;REFB=-0.0510698,-0.110624;REVB=0.0641277,0.110814;RO=85;SAF=2,84;SAR=50,34;SRF=17;SRR=68;SSEN=0,0;SSEP=0,0.328125;SSSB=-0.504054,0.394265;STB=0.789128,0.571642;STBP=0.007,0;TYPE=snp,del;VARB=0.0245642,0.134756;ANN=RAG1 GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 0/2:42:333:320:85:56:52,118:39,225:0.121875,0.703125:50,34:2,84:17:68:37,89:2,136:14:42
chr11 95825383 . C T 143.023 PASS AF=0.47561;AO=28;DP=71;FAO=39;FDP=82;FR=.;FRO=43;FSAF=6;FSAR=33;FSRF=40;FSRR=3;FWDB=-0.0301041;FXX=0.0238067;HRUN=1;LEN=1;MLLD=189.321;OALT=T;OID=.;OMAPALT=T;OPOS=95825383;OREF=C;PB=0.5;PBP=1;QD=6.97675;RBI=0.153139;REFB=-0.0165525;REVB=0.150151;RO=43;SAF=6;SAR=22;SRF=40;SRR=3;SSEN=0;SSEP=0;SSSB=-0.666847;STB=0.875258;STBP=0;TYPE=snp;VARB=0.0275999;ANN=MAML2 GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 0/1:143:71:82:43:43:28:39:0.47561:22:6:40:3:33:6:40:3 STRAND BIAS 82 reads


Desired output



chr1 93159358 . CT AC,C 51.3482 PASS AF=0,0.538462;AO=4,12;DP=39;FAO=0,21;FDP=39;FR=.;FRO=18;FSAF=0,11;FSAR=0,10;FSRF=15;FSRR=3;FWDB=0.0379899,0.0954749;FXX=0;HRUN=1,5;LEN=2,1;MLLD=22.441,10.1519;OALT=AC,-;OID=.,.;OMAPALT=AC,C;OPOS=93159358,93159359;OREF=CT,T;PB=0.5,0.5;PBP=1,1;QD=5.26648;RBI=0.0698716,0.219287;REFB=-0.0299799,-0.0774582;REVB=0.0586414,0.197411;RO=22;SAF=0,9;SAR=4,3;SRF=17;SRR=5;SSEN=0,0;SSEP=0,0;SSSB=-0.747246,-0.0336118;STB=0.5,0.645036;STBP=1,0.086;TYPE=mnp,del;VARB=0.059091,0.135819;ANN=EVI5 GOOD 39 Reads GOOD readsGT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 0/2:46:39:39:22:18:4,12:0,21:0,0.538462:4,3:0,9:17:5:0,10:0,11:15:3
chr1 93073228 . C A 142.937 PASS AF=0.4;AO=42;DP=105;FAO=42;FDP=105;FR=.;FRO=63;FSAF=25;FSAR=17;FSRF=28;FSRR=35;FWDB=-0.00213313;FXX=0.00943307;HRUN=2;LEN=1;MLLD=178.966;OALT=A;OID=.;OMAPALT=A;OPOS=93073228;OREF=C;PB=0.5;PBP=1;QD=5.44523;RBI=0.00753887;REFB=-0.0179184;REVB=-0.00723079;RO=63;SAF=25;SAR=17;SRF=28;SRR=35;SSEN=0;SSEP=0;SSSB=0.159972;STB=0.590597;STBP=0.144;TYPE=snp;VARB=0.0207923;ANN=EVI5 GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 0/1:142:105:105:63:63:42:42:0.4:17:25:28:35:17:25:28:35 GOOD 105 reads
chr1 93089823 . T C 1038.33 PASS AF=1;AO=110;DP=111;FAO=111;FDP=111;FR=.;FRO=0;FSAF=76;FSAR=35;FSRF=0;FSRR=0;FWDB=0.0247073;FXX=0.00892777;HRUN=2;LEN=1;MLLD=59.5565;OALT=C;OID=.;OMAPALT=C;OPOS=93089823;OREF=T;PB=0.5;PBP=1;QD=37.4173;RBI=0.025038;REFB=-0.0649256;REVB=-0.0040564;RO=1;SAF=75;SAR=35;SRF=1;SRR=0;SSEN=0;SSEP=0;SSSB=-0.00628837;STB=0.5;STBP=1;TYPE=snp;VARB=0.000880627;ANN=EVI5 GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 1/1:42:111:111:1:0:110:111:1:35:75:1:0:35:76:0:0 GOOD 111 reads
chr11 36596027 . AG AA,A 1031.71 PASS AF=0.121875,0.703125;AO=52,118;DP=333;FAO=39,225;FDP=320;FR=.;FRO=56;FSAF=2,136;FSAR=37,89;FSRF=14;FSRR=42;FWDB=0.0148693,0.00188064;FXX=0.0615818;HRUN=5,5;LEN=1,1;MLLD=11.6837,10.3394;OALT=A,-;OID=.,.;OMAPALT=AA,A;OPOS=36596028,36596028;OREF=G,G;PB=0.5,0.5;PBP=1,1;QD=12.8964;RBI=0.065829,0.11083;REFB=-0.0510698,-0.110624;REVB=0.0641277,0.110814;RO=85;SAF=2,84;SAR=50,34;SRF=17;SRR=68;SSEN=0,0;SSEP=0,0.328125;SSSB=-0.504054,0.394265;STB=0.789128,0.571642;STBP=0.007,0;TYPE=snp,del;VARB=0.0245642,0.134756;ANN=RAG1 GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 0/2:42:333:320:85:56:52,118:39,225:0.121875,0.703125:50,34:2,84:17:68:37,89:2,136:14:42 GOOD 320 Reads GOOD
chr11 95825383 . C T 143.023 PASS AF=0.47561;AO=28;DP=71;FAO=39;FDP=82;FR=.;FRO=43;FSAF=6;FSAR=33;FSRF=40;FSRR=3;FWDB=-0.0301041;FXX=0.0238067;HRUN=1;LEN=1;MLLD=189.321;OALT=T;OID=.;OMAPALT=T;OPOS=95825383;OREF=C;PB=0.5;PBP=1;QD=6.97675;RBI=0.153139;REFB=-0.0165525;REVB=0.150151;RO=43;SAF=6;SAR=22;SRF=40;SRR=3;SSEN=0;SSEP=0;SSSB=-0.666847;STB=0.875258;STBP=0;TYPE=snp;VARB=0.0275999;ANN=MAML2 GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 0/1:143:71:82:43:43:28:39:0.47561:22:6:40:3:33:6:40:3 STRAND BIAS 82 reads

Answer

It would really be nice to have some more information

On the face of it, as long as you're happy to ignore the second value for STB, you can simply remove the insistence of a semicolon ; after the value. Like this

perl -ple '/^[^#].*FDP=(\d+);.*STB=(\d+\.\d+)/ and $_.=($2 >= 0.8?" STRAND BIAS ":" GOOD ").$1." reads"'

But that's pretty awful Perl code and, as you have found, damned hard to debug

I would prefer something like this: a script which extracts all of the labelled values into a hash and uses them directly from there

I've guessed that you want all of the STB values to be under 0.8 for a GOOD result, so I've used the all function from List::Util to test that. I just split the STB value on commas and create a Boolean $all_ok status variable that indicates whether this is true. This works fine whether there's one value or one thousand

Then printf builds the output string from the components that we've calculated

I've emptied the $line variable so that we can see just what is appended to each line for debugging purposes. Just delete that statement or comment it out for the real run

use strict;
use warnings qw/ all FATAL /;

use List::Util 'all';

while ( <> ) {
    next unless /\S/;
    my @fields = split;

    chomp( my $line = $_ );

    my %values = map { split /=/ } split /;/, $fields[7];
    my $all_ok = all { $_ < 0.8 } split /,/, $values{STB};

    $line = '';  # for debugging

    printf "%s %s %s reads\n", $line, $all_ok ? 'GOOD' : 'STRAND BIAS', $values{FDP};
}

output

 GOOD 39 reads
 GOOD 105 reads
 GOOD 111 reads
 GOOD 320 reads
 STRAND BIAS 82 reads



Update

Now that you've explained that you want a list of status values in the output I can write a better solution. You no longer need List::Util::all, but instead need to create an array of boolean status values from the list of STB data

Don't forget to comment out $line = '' before you run the data for real

It looks like this

use strict;
use warnings qw/ all FATAL /;

use List::Util 'all';

while ( <> ) {
    next unless /\S/;
    my @fields = split;

    chomp( my $line = $_ );

    my %values = map { split /=/ } split /;/, $fields[7];

    my @stb_ok = map { $_ < 0.8 } split /,/, $values{STB};
    my @good   = map { $_ ? 'GOOD' : 'STRAND BIAS' } @stb_ok;

    $line = '';  # for debugging

    printf "%s %s %s reads\n", $line, "@good", $values{FDP};
}

output

 GOOD GOOD 39 reads
 GOOD 105 reads
 GOOD 111 reads
 GOOD GOOD 320 reads
 STRAND BIAS 82 reads