Savage Henry Savage Henry - 28 days ago 7
Python Question

Remove duplicates in one column, in multiple other columns, change row value to max of that column?

I have a dataframe with the following format:

id x1 x2 x3
A 1 0 0
A 0 1 0
A 0 0 0
B 1 0 0
B 0 0 0
B 0 0 1


I would like to "compress" this based on the repeated values in
id
, keeping the
1
's under the correct column, but in a single row for the
id
. That is, the dataframe should look like this at the end:

id x1 x2 x3
A 1 1 0
B 1 0 1


My actual data set is millions of rows with potentially thousands of columns. I could do this by grouping by rows and looping over columns, but would like to know if there is a more Pythonic/Pandified/memory efficient way of doing this.

I have looked at answers for questions that are concerned with keeping the max (or some other function) of a column, but that goes back to having to loop over all the columns. (Here is one example.)

Thanks!

Answer

This is the simplest, most efficient way to do this in pandas:

In [8]: df.groupby('id').max()
Out[8]: 
    x1  x2  x3
id            
A    1   1   0
B    1   0   1
Comments