Kayan Kayan - 4 months ago 7
Bash Question

Average of multiple files without considering missing values using Shell

I have five different files. Part of each file looks as:

ifile1.txt ifile2.txt ifile3.txt ifile4.txt ifile5.txt
2 3 2 3 2
1 2 /no value 2 3
/no value 2 4 3 /no value
3 1 0 0 1
/no value /no value /no value /no value /no value


I need to compute average of these five files without considering missing values. i.e.

ofile.txt
2.4
2.0
3.0
1.0
99999

Here 2.4 = (2+3+2+3+2)/5
2.0 = (1+2+2+3)/4
3.0 = (2+4+3)/3
1.0 = (3+1+0+0+1)/5
99999 = all are missing


I was trying in the following way, but don't feel it is a proper way.

paste ifile1.txt ifile2.txt ifile3.txt ifile4.txt ifile5.txt > ofile.txt
tr '\n' ' ' < ofile.txt > ofile1.txt
awk '!/\//{sum += $1; count++} {print count ? (sum/count) : count;sum=count=0}' ofile1.txt > ofile2.txt
awk '!/\//{sum += $2; count++} {print count ? (sum/count) : count;sum=count=0}' ofile1.txt > ofile3.txt
awk '!/\//{sum += $3; count++} {print count ? (sum/count) : count;sum=count=0}' ofile1.txt > ofile4.txt
awk '!/\//{sum += $4; count++} {print count ? (sum/count) : count;sum=count=0}' ofile1.txt > ofile5.txt
awk '!/\//{sum += $5; count++} {print count ? (sum/count) : count;sum=count=0}' ofile1.txt > ofile6.txt
paste ofile2.txt ofile3.txt ofile4.txt ofile5.txt ofile6.txt > ofile7.txt
tr '\n' ' ' < ofile7.txt > ofile.txt

Answer

The following script.awk will deliver what you want:

BEGIN {
    gap = -1;
    maxidx = -1;
}
{
    if (NR != FNR + gap) {
        idx = 0;
        gap = NR - FNR;
    }
    if (idx > maxidx) {
        maxidx = idx;
        count[idx] = 0;
        sum[idx] = 0;
    }
    if ($0 != "/no value") {
        count[idx]++;
        sum[idx] += $0;
    }
    idx++;
}
END {
    for (idx = 0; idx <= maxidx; idx++) {
        if (count[idx] == 0) {
            sum[idx] = 99999;
            count[idx] = 1;
        }
        print sum[idx] / count[idx];
    }
}

You call it with:

awk -f script.awk ifile*.txt

and it allows for an arbitrary number of input files, each with an arbitrary number of lines. It works as follows:


BEGIN {
    gap = -1;
    maxidx = -1;
}

This begin section runs before any lines are processed and it sets the current gap and maximum index accordingly.

The gap is the difference between the overall line number NR and the file line number FNR, used to detect when you switch files, something that's very handy when processing multiple input files.

The maximum index is used to figure out the largest line count so as to output the correct number of records at the end.


{
    if (NR != FNR + gap) {
        idx = 0;
        gap = NR - FNR;
    }
    if (idx > maxidx) {
        maxidx = idx;
        count[idx] = 0;
        sum[idx] = 0;
    }
    if ($0 != "/no value") {
        count[idx]++;
        sum[idx] += $0;
    }
    idx++;
}

The above code is the meat of the solution, executed per line. The first if statement is used to detect whether you've just moved into a new file and it does this simply so it can aggregate all the associated lines from each file. By that I mean the first line in each input file is used to calculate the average for the first line of the output file.

The second if statement adjusts maxidx if the current line number is beyond any previous line number we've encountered. This is for the case where file one may have seven lines but file two has nine lines (not so in your case but it's worth handling anyway). A previously unencountered line number also means we initialise its sum and count to be zero.

The final if statement simply updates the sum and count if the line contains anything other than /no value.

And then, of course, you need to adjust the line number for the next time through.


END {
    for (idx = 0; idx <= maxidx; idx++) {
        if (count[idx] == 0) {
            sum[idx] = 99999;
            count[idx] = 1;
        }
        print sum[idx] / count[idx];
    }
}

In terms of outputting the data, it's a simple matter of going through the array and calculating the average from the sum and count. Notice that, if the count is zero (all corresponding entries were /no value), we adjust the sum and count so as to get 99999 instead. Then we just print the average.


So, running that code over your input files gives, as requested:

$ awk -f script.awk ifile*.txt
2.4
2
3
1
99999