Paul Paul - 4 months ago 20
Bash Question

Deduplicate two column file based on minimum value in column 2 AWK / BASH

I have at hand a file looking like this (delimited by tabs, 2 fields):

denovo0 90.2
denovo1 97.7
denovo1 97.7
denovo1 96.9
denovo10 93.8
denovo10 92.2
denovo10 91.5
denovo100 95.3
denovo100 95.3
denovo100 94.6


And I would like to retain only unique strings in the first field that have the lowest value in the second column to have:

denovo0 90.2
denovo1 96.9
denovo10 91.5
denovo100 94.6


As is can be seen in the upper example, some rows in the file may be complete duplicates of other rows, I am not sure how that would influence solutions.

I have looked up similar solutions on StackOverflow, e.g: Uniq in awk; removing duplicate values in a column using awk , but was not able to adopt them.

I would be happy if someone could help.

I'd prefer using AWK but BASH would also be an option. I am working with MacOSX Yosemite.

I would be really happy if someone could help out.

Thank you and kind regards,

Paul

Answer

You can get the results you show with:

awk '{if (!($1 in a)) a[$1] = $2} END { for (key in a) print key, a[key] }'

Output:

denovo0 90.2
denovo1 97.7
denovo10 93.8
denovo100 95.3

For the results described (minimum value in column 2 for each key in column 1), you can use:

awk '{ if (!($1 in a)) a[$1] = $2; else if (a[$1] > $2) a[$1] = $2 }
     END { for (key in a) print key, a[key] }'

Output:

denovo0 90.2
denovo1 96.9
denovo10 91.5
denovo100 94.6

You can also get the sample output in the question by looking for the maximum value for each key; it so happens that the maximum value is also the first for each key in the sample data.

awk '{ if (!($1 in a)) a[$1] = $2; else if (a[$1] < $2) a[$1] = $2 }
     END { for (key in a) print key, a[key] }'

Output:

denovo0 90.2
denovo1 97.7
denovo10 93.8
denovo100 95.3
Comments