discipulus discipulus - 2 months ago 5
Bash Question

Delete the column/field if it contains more than specific number of non-numeric values

Similar to my earlier question Delete the row if it contains more than specific number of non numeric values, I have a data that has non-numeric fields:
I want to delete those columns (fields) that has more than specific number of non-numeric characters (

Off
and
No Data
) in this example. I want to count from 3rd field onwards, i.e, forget the first two rows.

Tag, Description,2015/01/01,2015/01/01 00:01:00,2015/01/01 00:02:00
1827XYZR/KB.SAT,Data from Process Value, 2.10000, No Data, 2.7
1871XYZR/KB.RAT,Data from process value, Off , 2.87583, No Data
1962XYMK/KB.GAT,Data from Process Value, No Data, 5 , 3
1867XYST/KB.FAT,Data from process value, 1.05000, 5.87 , 7.80
1871XKZR/KB.VAT,Data from process value, No Data, Off , 2


In the data above (there are no spaces, I have kept it only to show the formatting), I want to delete all those columns that has more than two (3 or more) occurrence of non-numeric characters. Here, Column 3 has two
No Data
and one
Off
, so I want to delete the 3rd Column here. I tried this in a loop without success

awk -F ',' '{
for(n=3; n<=NF; n++)
{
a = $n
if(n!=NF)
{
fmt="%s,";
}
else
{
fmt="%s\n";
}
if(gsub(/No Data|Off/,"",a)<2)
{
printf(fmt,$n);
}
}
}' testfile.txt


I also tried

awk -F, '{for(i=3;i<=NF;i++)if(gsub(/No Data|Off/,"",$i)<3)f=f?f FS $i:$i;print f;f=""}' testfile.txt


but this only deletes the the words
No Data
and
Off
but keeps the fields (columns) intact. My desired output is:

Tag, Description,2015/01/01 00:01:00,2015/01/01 00:02:00
1827XYZR/KB.SAT,Data from Process Value, Off ,2.7
1871XYZR/KB.RAT,Data from process value, 2.87583 ,No Data
1962XYMK/KB.GAT,Data from Process Value, 5 ,3
1867XYST/KB.FAT,Data from process value, 5.87 ,7.80
1871XKZR/KB.VAT,Data from process value, Off ,2

Answer

You can use this awk command:

awk -F, '{
   for (i=1; i<=NF; i++) {
      cell[NR,i]=$i
      if (i>2 && $i+0 != $i)
         nn[i]++
   }
   nf=NF
   nr=NR
} 
END {
   for (r=1; r<=nr; r++) {
      for (c=1; c<=nf; c++)
         if (nn[c]<4)
            printf "%s%s", (c>1?FS:""), cell[r,c]
      print ""
   }
}' file

Tag,Description,2015/01/01 00:01:00,2015/01/01 00:02:00
1827XYZR/KB.SAT,Data from Process Value,No Data,2.7
1871XYZR/KB.RAT,Data from process value,2.87583,No Data
1962XYMK/KB.GAT,Data from Process Value,5      ,3
1867XYST/KB.FAT,Data from process value,5.87   ,7.80
1871XKZR/KB.VAT,Data from process value,Off    ,2