AishwaryaKulkarni AishwaryaKulkarni - 1 year ago 42
Bash Question

Removing duplicate lines with different columns

I have a file which looks like follows:

ENSG00000197111:I12 0
ENSG00000197111:I12 1
ENSG00000197111:I13 0
ENSG00000197111:I18 0
ENSG00000197111:I2 0
ENSG00000197111:I3 0
ENSG00000197111:I4 0
ENSG00000197111:I5 0
ENSG00000197111:I5 1


I have some lines that are duplicated but I cannot remove by sort -u because the second column has different values for them (1 or 0). How do I remove such duplicates by keeping the lines with second column as 1 such that the file will be

ENSG00000197111:I12 1
ENSG00000197111:I13 0
ENSG00000197111:I18 0
ENSG00000197111:I2 0
ENSG00000197111:I3 0
ENSG00000197111:I4 0
ENSG00000197111:I5 1

Answer Source

you can use awk and or operator, if the order isn't mandatory

awk '{d[$1]=d[$1] || $2}END{for(k in d) print k, d[k]}' file

you get

ENSG00000197111:I2 0
ENSG00000197111:I3 0
ENSG00000197111:I4 0
ENSG00000197111:I5 1
ENSG00000197111:I12 1
ENSG00000197111:I13 0
ENSG00000197111:I18 0

Edit, only sort solution

You can use sort with a double pass, example

sort -k1,1 -k2,2r file | sort -u -k1,1

you get,

ENSG00000197111:I12 1
ENSG00000197111:I13 0
ENSG00000197111:I18 0
ENSG00000197111:I2 0
ENSG00000197111:I3 0
ENSG00000197111:I4 0
ENSG00000197111:I5 1
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download