biotech biotech - 1 year ago 69
Perl Question

Collapse rows based on column 1

I would like to have a file in a format a bit distant of what I have.

# input file
Q97R95 GO:0004349, GO:0005737, GO:0006561
Q97R95 GO:0004349, GO:0006561
Q97R95 GO:0005737, GO:0006561
Q97R95 GO:0006561

# desired output (removed duplicates and rows collapsed)
Q97R95 GO:0004349,GO:0005737,GO:0006561

Answer Source

You can make use of 2-d array of gnu awk:

awk -F'[, ]+' '{for(i=2;i<=NF;i++)r[$1][$i]}
         END{for(x in r){
                printf "%s ",x;b=0;
                for(y in r[x]){printf "%s%s",(b?",":""),y;b=1}
                print ""}
         }' file

It gives:

Q97R95 GO:0005737,GO:0006561,GO:0004349

The duplicated fields are removed, however the order was not kept.