Kayan Kayan - 7 months ago 9
Bash Question

How to delete the rows whose column 2 and column 3 matches with some previous using awk?

I have a file with 4 columns:

ifile.txt
3 5 2 2
1 4 2 1
4 5 7 2
5 5 7 1
0 0 1 1
3 5 7 3
5 4 2 2


I would like to delete the rows whose column 2 & 3 values are same with some previous. for instance, row 2 & 7 have same values in column 2 & 3. Similarly row 3 & 4 & 6 has same values in column 2 & 3. So I want to keep the 2rd row and delete 7th row. Similarly keep 3rd row and delete 4th and 6th row. my output is:

ofile.txt
3 5 2 2
1 4 2 1
4 5 7 2
0 0 1 1


I tried with this command

awk '{a[NR]=$2""$3} a[NR]!=a[NR-1]{print}' ifile.txt > ofile.txt


But it is not giving my desire output.

Answer
$ awk '!(($2,$3) in a); {a[$2,$3]}' ifile
3  5  2  2
1  4  2  1
4  5  7  2
0  0  1  1

How it works

awk reads the input file one line at a time. Each input line is divided into fields. In this case, the important fields are the second, denoted $2, and the third, denoted $3.

  • !(($2,$3) in a)

    This condition is true if $2,$3 is not a key in associative array a. Since no action is specified, when this condition is true, the default action is performed which is to print the line.

    In more detail, ($2,$3) in a is true when $2,$3 is a key of a. We, however, want the condition to be true in the opposite. Consequently, we apply awk's negation operator, !, to it.

  • a[$2,$3]

    This adds $2,$3 as a key of a.