Harinder Singh Harinder Singh - 2 months ago 14
Python Question

Remove duplicate rows from DataFrame but keeping one column as list- Python

I have a dataframe like this:

file:
| FIRST | LAST | ID |
---------------------------
0 "ABC" 12 35
1 "ABC" 14 35
2 "AB" 15 36


Now, what I want is:

file:
| FIRST | LAST | ID |
---------------------------
0 "ABC" [12,14] 35
2 "AB" 15 36


For this problem let's assume that if ID of two rows is equal then all the values except LAST is also equal.

Therefore, replace all the value except the values of last, which are added to a list.

I tried using solution given in this link:
Pandas DataFrame - Combining one column's values with same index into list

I used this:

file = file.groupby('ID')

file = file['Last'].unique()


This is the output I got:

ID
35 [12, 14]
36 [15]
Name: Last, dtype: object


Probably, I am missing something in the groupby().

Thanks in advance :)

UPDATE:

My original Dataframe has more than 100 columns.
if ID of two rows is equal then all the values except LAST is also equal.

Answer Source

Given that only the last two rows are different for a given ID, just take the first value when applying a groupby to them. For the column `LAST', use its value or convert it to a list of unique items if there are more than one.

grouping_cols = ['ID', ...]
agg_cols = {col: 'first' for col in df if col not in grouping_cols}
agg_cols['LAST'] = lambda x: x.unique().tolist() if len(x) > 1 else x.iat[0]
>>> df.groupby(grouping_cols, as_index=False).agg(agg_cols)
  ID      LAST FIRST
0  35  [12, 14]   ABC
1  36        15    AB