Pleb Pleb - 6 days ago 7
Bash Question

AWK loosing data on array operation

I'm writing a little shell script to list data from a CSV file.
I got the following code that is actually mostly doing the job :

awk 'BEGIN{FS=OFS=";"} {
k = $7 FS $8 FS $14;
if($4=="coll"){
if($1=="2014")
a[k] += $3
else if($1=="2015")
b[k] += $3
else if($1=="2016")
c[k] += $3}
else{
if($1=="2014")
d[k] += $3
else if($1=="2015")
e[k] += $3
else if($1=="2016")
f[k] += $3}
}
END {
for (k in a) {
print k FS a[k] FS d[k] FS b[k] FS e[k] FS c[k] FS f[k];
}
}' $file1 > $file2


$4 can be two values but can be used several times for the same year, here's why I use an array with the k key. Field $1 is the year, but not all years have values, and sometimes they got one value for the "coll" value on $4 but not for the other one. $3 got a numerical value and I need the specific total according to the year and the $4 field value, from whence all these if and else if statements.

All my records are printed until I have something for 2014. if there's no value for that specific year, I simply lost the data, even if something exist for 2015 or/and 2016.

I don't see why, can someone show me the light ? Thanks !

P.S. : here's a sample data from the file

2014;U;4;coll;sector;activity;REGION1;1A;;;;;;CBS STRAS;;;;;;;;;;;;;
2014;U;11;adv;sector;activity;REGION1;1A;;;;;;CBS STRAS;;;;;;;;;;;;;
2014;E;19;coll;sector;activity;REGION1;1A;;;;;;CBS STRAS;;;;;;;;;;;;;
2014;E;164;adv;sector;activity;REGION1;1A;;;;;;CBS STRAS;;;;;;;;;;;;;
2015;U;5;coll;sector;activity;REGION1;1A;;;;;;CBS STRAS;;;;;;;;;;;;;
2015;U;70;adv;sector;activity;REGION1;1A;;;;;;CBS STRAS;;;;;;;;;;;;;
2015;E;17;coll;sector;activity;REGION1;1A;;;;;;CBS STRAS;;;;;;;;;;;;;
2015;E;205;adv;sector;activity;REGION1;1A;;;;;;CBS STRAS;;;;;;;;;;;;;
2016;R;3;adv;sector;activity;REGION1;1A;;;;;;IND RET ORG HAG BIS;;;;;;;;;;;;;x

Answer

Your loop says

for (k in a) {

so you'll only use key values that exist in array a[], i.e. were populated by:

if($4=="coll"){
    if($1=="2014")
        a[k] += $3

Change

k = $7 FS $8 FS $14;
...
for (k in a) {

to:

k = $7 FS $8 FS $14;
keys[k]
...
for (k in keys) {

so you create and later loop on the indices of a new array keys[] that contains all of the indices for all of the arrays.

In reality, of course you should be doing something like this instead:

awk 'BEGIN{ FS=OFS=";" }
{
    k = $7 OFS $8 OFS $14
    keys[k]
    foo[$4]
    years[$1]
    a[k,$4,$1] += $3
}
END {
    for (k in keys) {
        printf "%s", k
        for (m in foo) {
            for (year in years) {
                printf "%s%s", OFS, a[k,m,year]
            }
        }
        print ""
    }
}'

or maybe even just a simple loop on a[] depending on what output format you need.

Comments