Daniel Daniel - 1 month ago 7
Linux Question

How to use AWK command for a variable with multiple entries?

How can I use awk to get the information from an ID with various observations/variables. As an example following, I have two IDs (a,b). Both have 6 observations at different ages with different measurements.

The file is sorted based on ID and ages.
The 'line numbers' are not the part of the actual data file but the headings are!

I'd like to use awk to identify and extract {print} the "measurement" difference between the earliest age and the latest age of every each unique ID. Looking at the following example, for ID (a), I'd like to obtain 50-11=39.

id age measurement
1 a 2 11
2 a 4 20
3 a 6 19
4 a 7 89
5 a 8 43
6 a 12 50
7 b 1 15
8 b 3 23
9 b 5 30
10 b 6 33
11 b 7 45
12 b 10 60


I would highly appreciate if you please explain in details that I could learn.

Answer

Ordered input data

Given that the line numbers are not part of the data but the headings are, the file looks more like:

id age  measurement
a   2   11
a   4   20
a   6   19
a   7   89
a   8   43
a   12  50
b   1   15
b   3   23
b   5   30
b   6   33
b   7   45
b   10  60

This script analyzes that file as desired:

awk 'NR==1    { next }
     $1 != id { if (id != "") print id, id_max - id_min; id = $1; id_min = $3; }
              { id_max = $3 }
     END      { if (id != "") print id, id_max - id_min; }' data

The first line skips the first line of the file.

The second line checks whether the ID has changed; if so, it checks whether there was an old ID, and if so, prints the data. It then stores the current ID and records the measurement for the minimum age.

The third line records the measurement for the current maximum age for the current ID.

The last line prints out the data for the last group in the file, if there was any data in the file.

Sample output:

a 39
b 45

Unordered input data

Even if the data is not sequenced by ID and age, the code can be adapted to work:

awk 'NR==1  {   next }
            {   if (group[$1] == 0)
                {
                    group[$1] = 1
                    order[++sequence] = $1
                    id_age_min[$1] = $2; id_val_min[$1] = $3
                    id_age_max[$1] = $2; id_val_max[$1] = $3
                }
                if ($2 < id_age_min[$1]) { id_age_min[$1] = $2; id_val_min[$1] = $3; }
                if ($2 > id_age_max[$1]) { id_age_max[$1] = $2; id_val_max[$1] = $3; }
            }
     END    {   for (i = 1; i <= sequence; i++)
                {
                    id = order[i];
                    print id, id_val_max[id] - id_val_min[id]
                }
            }' data

This skips the heading line, then tracks groups as they arrive and arranges to print the data in that order (group and order — and sequence). For each row, if the group has not been seen before, set up the data for the current row (the values are both the minimum and maximum). If the age in the current row ($2) is less than the age for the current minimum age (id_min_age[$1]), record the new age and the corresponding value. If the age in the current row is larger than the age for the current maximum age (id_max_age[$1]`), record the new age and the corresponding value.

At the end, for each ID in sequence, print out the ID and the difference between the maximum and minimum value for that ID.

Shuffled data:

id age  measurement
a   12  50
b   10  60
a   4   20
b   3   23
b   5   30
a   7   89
b   6   33
b   7   45
a   8   43
a   6   19
a   2   11
b   1   15

Sample output:

a 39
b 45

It so happens that an a row still appeared before a b row, so the output is the same as before.

Comments