Savage Henry Savage Henry - 8 months ago 54
Python Question

Remove duplicates in one column, in multiple other columns, change row value to max of that column?

I have a dataframe with the following format:

id x1 x2 x3
A 1 0 0
A 0 1 0
A 0 0 0
B 1 0 0
B 0 0 0
B 0 0 1

I would like to "compress" this based on the repeated values in
, keeping the
's under the correct column, but in a single row for the
. That is, the dataframe should look like this at the end:

id x1 x2 x3
A 1 1 0
B 1 0 1

My actual data set is millions of rows with potentially thousands of columns. I could do this by grouping by rows and looping over columns, but would like to know if there is a more Pythonic/Pandified/memory efficient way of doing this.

I have looked at answers for questions that are concerned with keeping the max (or some other function) of a column, but that goes back to having to loop over all the columns. (Here is one example.)



This is the simplest, most efficient way to do this in pandas:

In [8]: df.groupby('id').max()
    x1  x2  x3
A    1   1   0
B    1   0   1