Thomas Paine Thomas Paine - 3 months ago 5
Perl Question

AWK sum of multiple record's field and combine if equal to variable

I have a challenge with data alignment. In the below example, there are rows that belong at the end of the previous row, and other rows that need to be changed to null.

117@M@7@7@81 1/2@61/2@51/2@41/2@1PAWS@7.40 5w BITE SLOW
117@M@11@5@41/2@31/2@31@3PAWS@21/2@3.40 4w BITE SLOW
124@M@2@2@31/2@41/2@41/2@21/2@33/4@1.30* led 1/16p, BITE SLOW
117@M@3@3@21@21/2@2PAWS@1PAWS@41 3/4 ins, BITE SLOW
117@--@1@6@61/2@51/2@61/2@61/2@5TEETH@26.25 cut BITE SLOW
117@--@4@8@7PAWS@82@81/2@81/2@6TEETH@13.10 5w BITE SLOW
117@M@10@9@91@91/2@9PAWS@91@71 1/4@10.10 4w BITE FAST
124@M@6@4@51/2@71@71@71/2@8TEETH@15.90 3w BITE FAST
117@M@9@1@11/2@11/2@11/2@51@9TEETH@8.70 in BITE FAST
117@M@8@11@11@10PAWS@105@103@10 2 3/4
19.50 6w upper, no response
124@M@5@10@101@11@11@11@11@86.50 off slow, no impact


What I have done so far is create a variable that outputs the most used field count of all the records. I want to use that variable in awk to find the rows that are lacking data or are incorrectly formatted.

FLDCNT=$(sed '/^ *$/d' file | sed 's/^ *//g' |awk -F@ '{print NF}' | sort | uniq -c | sort |awk 'END{print $NF}');


Using something along the lines of:

awk -F@ -v x=$FLDCNT '{if(NF != x && [some code to check record and next record's combined field count = $FLDCNT] ) [add the next row to the end of the current rows fields] print }' file


I can find the rows/records that I need to work with, but I am unsure on how to preform the steps in the "[]" section of the above code.

In the end, the output should be:

117@M@7@7@81 1/2@61/2@51/2@41/2@1PAWS@7.40 5w BITE SLOW
117@M@11@5@41/2@31/2@31@3PAWS@21/2@3.40 4w BITE SLOW
124@M@2@2@31/2@41/2@41/2@21/2@33/4@1.30* led 1/16p, BITE SLOW
117@M@3@3@21@21/2@2PAWS@1PAWS@41 3/4 ins, BITE SLOW@NULL
117@--@1@6@61/2@51/2@61/2@61/2@5TEETH@26.25 cut BITE SLOW
117@--@4@8@7PAWS@82@81/2@81/2@6TEETH@13.10 5w BITE SLOW
117@M@10@9@91@91/2@9PAWS@91@71 1/4@10.10 4w BITE FAST
124@M@6@4@51/2@71@71@71/2@8TEETH@15.90 3w BITE FAST
117@M@9@1@11/2@11/2@11/2@51@9TEETH@8.70 in BITE FAST
117@M@8@11@11@10PAWS@105@103@10 2 3/4@19.50 6w upper, no response
124@M@5@10@101@11@11@11@11@86.50 off slow, no impact


I know from the example that there are some easier solutions, like using an if statement on a known field format. However, I am working with thousands of files, and the field and record counts are different for all of them.

In summary, I am trying to find the most common column count of all the rows, find the rows that don't match that common number, lets call those oddballs, and check if appending the next row to the oddballs will cause the oddballs to have the same column count as the common number, and if so append the rows together.

Answer

I noticed your question is tagged perl, too. So, here's a readable Perl solution:

#!/usr/bin/perl
use warnings;
use strict;

my %count;
my $max = 0;

open my $FH, '<', shift or die $!;
while (<$FH>) {
    my $c = split /@/;
    $count{$c}++;
    $max = $c if $c > $max;
}

warn "The max count ($max) different from the most common\n"
    if grep $_ > $count{$max}, values %count;

seek $FH, 0, 0;
my $leftover = 0;
while (<$FH>) {
    my $c = $leftover + split /@/;
    if ($leftover) {
        print '@';
        if ($c > $max) {
            $c -= $leftover;
            print "NULL\n";
        }
    }

    if ($c != $max) {
        $leftover += $c;
        chomp;
    } else {
        $leftover = 0;
    }
    print;
}