Michael Michael - 5 months ago 8
Bash Question

Awk for selecting rows with columns that share a value

I have a table resembling the following one (but much longer):

A B C D E
1 YRI_1 YRI_2 10761 0
2 YRI_3 YRI_3 7825 0
3 YRI_1 YRI_4 9880 0
4 YRI_1 Medit_1 79707 0
5 YRI_2 Medit_2 73865 0
6 YRI_2 Medit_3 77165 0
7 YRI_3 Medit_4 76428 0
8 YRI_3 CHB_1 8273 0
9 YRI_2 CHB_2 10668 0
10 YRI_1 CHB_3 8391 0


I would like to obtain:

A B C D E
2 YRI_3 YRI_3 7825 0
4 YRI_1 Medit_1 79707 0
5 YRI_2 Medit_2 73865 0
9 YRI_2 CHB_2 10668 0


i.e. I would like to keep those rows whose numbers in columns B and C only match, for example YRI_1 / Medit_1 both have a "1" so it is a wanted row, but I would not like to keep for example YRI_1 / Medit_10, since this is "10" although it contains a "1".

I tried with awk:

for i in {1..4}
do
awk '$2=="*$i"||$3=="*$i" {print $1,$2,$3,$4,$5}' table > desired_table
done


where $i was supposed to be substituted in each iteration by the next number in the list 1..4, and also I pretended * to mean anything, because I am interested in the number (but I guess this is not the way to do it with awk).

Answer

You can use this awk command:

awk 'split($2, a, /_/) && split($3, b, /_/) && a[2] == b[2]' file

A     B        C          D           E
2     YRI_3    YRI_3      7825        0
4     YRI_1    Medit_1    79707       0
5     YRI_2    Medit_2    73865       0
9     YRI_2    CHB_2      10668       0
  • We use 2 split functions to split $2 and $3 and then compare 2nd fields of split array for equality.
  • split returns number of elements in the resulting array. By using awk 'split($2, a, /_/) && split($3, b, /_/) && we are making sure that split is returning non-zero values.