Savage Henry Savage Henry - 1 year ago 85
Python Question

Remove duplicates in one column, in multiple other columns, change row value to max of that column?

I have a dataframe with the following format:

id x1 x2 x3
A 1 0 0
A 0 1 0
A 0 0 0
B 1 0 0
B 0 0 0
B 0 0 1


I would like to "compress" this based on the repeated values in
id
, keeping the
1
's under the correct column, but in a single row for the
id
. That is, the dataframe should look like this at the end:

id x1 x2 x3
A 1 1 0
B 1 0 1


My actual data set is millions of rows with potentially thousands of columns. I could do this by grouping by rows and looping over columns, but would like to know if there is a more Pythonic/Pandified/memory efficient way of doing this.

I have looked at answers for questions that are concerned with keeping the max (or some other function) of a column, but that goes back to having to loop over all the columns. (Here is one example.)

Thanks!

Answer Source

This is the simplest, most efficient way to do this in pandas:

In [8]: df.groupby('id').max()
Out[8]: 
    x1  x2  x3
id            
A    1   1   0
B    1   0   1