Styx Styx - 26 days ago 5
Python Question

How to modify all rows of duplicates values in a Pandas DataFrame

I have a DataFrame df with ~250 000 rows, and ~50 columns.
In a column there are duplicates values, like these (simplified) :

'A' 'B'
jean 626
michel 950
john 382
john 356 <-- duplicate value
boris 315
clara 886
george 619
edmund 365
edmund 523 <-- duplicate value
edmund 703 <-- duplicate value
tony 416
edgard 108
tom 374
fanny 784
lucy 660
paul 728
rebecca 919
rebecca 131 <-- duplicate value
roger 924


What I would like to do is to create a new column 'C', where :
- each time I found an unique value in 'A', the value in 'C' is the same as in 'B'
- each time I found a duplicate value in 'A', the corresponding value in 'C' is the same as the first value of the duplicates group.
The number of duplicates in a group could by > 20.

The result should be :

'A' 'B' 'C'
jean 626 626
michel 950 950
john 382 382
john 356 382
max 315 315
clara 886 886
george 619 619
edmund 365 365
edmund 523 365
edmund 703 365
tony 416 416
edgard 108 108
tom 374 374
fanny 784 784
lucy 660 660
paul 728 728
rebecca 919 919
rebecca 131 919
roger 924 924


I’ve tried the following code :

def myfunc(group):
group['C'][group['C']==0]=group['B'][0]
return group
df=df.groupby('A').apply(myfunc)


This works but it takes a very, very long time to exec (~600 sec.).
Any idea to improve this ? Another solution to do this with more efficiency ?

DSM DSM
Answer

Your input doesn't quite match your output (boris vs. max, e.g.), but if I understand what you're after, you can use .transform("first"):

In [27]: df["C"] = df.groupby("A")["B"].transform("first")

In [28]: df.head(10)
Out[28]: 
         A    B    C
0    jean0  626  626
1  michel0  950  950
2    john0  382  382
3    john0  356  382
4   boris0  315  315
5   clara0  886  886
6  george0  619  619
7  edmund0  365  365
8  edmund0  523  365
9  edmund0  703  365

In [29]: len(df)
Out[29]: 249983

which takes only a few seconds for me.