Sreejith Menon Sreejith Menon - 5 months ago 24
Python Question

Python - applying concat function for two rows in pandas data frame

I have a large data frame which is basically an image id and a set of biological features. I have to combine all the IDs(the GIDs) and the resulting features should be concatenation of all the rows with the same ID.

I do understand the solution will require use of

group_by
function with an apply to rows that I need to concat to. I am not sure of the function argument to write in this case.

Sample data.

df[['GID','AID','INDIVIDUAL_NAME','NID']].head(10)
GID AID INDIVIDUAL_NAME NID
0 546 16167 ____ -16167
1 546 16168 ____ -16168
2 546 16169 ____ -16169
3 546 16170 ____ -16170
4 5666 13822 IBEIS_PZ_1866 2139
5 5713 9269 NNP_GIRM_0149 253
6 8838 11554 IBEIS_PZ_0373 646
7 1062 9439 NNP_GIRM_0143 234
8 1062 9440 ____ -9440
9 7748 9253 ____ -9253


I need the resulting output as

GID AID INDIVIDUAL_NAME NID
0 546 16167,16168 ,16169,16170 ____, ____, ____, ____ -16167,-16168 ,-16169,-16170
1 5666 13822 IBEIS_PZ_1866 2139
2 5713 9269 NNP_GIRM_0149 253 frontleft
3 8838 11554 IBEIS_PZ_0373 646
4 1062 9439,9440 NNP_GIRM_0143, ____ 234,-9440
5 7748 9253 ____ -9253


Also, I am look for a good tutorial where they have explained apply function for pandas data frame.

Answer

You can first cast to string and then groupby by column GID and aggregate function join:

df['AID'] = df.AID.astype(str)
df['NID'] = df.NID.astype(str)

print (df.groupby('GID').agg(','.join).reset_index())
    GID                      AID      INDIVIDUAL_NAME  \
0   546  16167,16168,16169,16170  ____,____,____,____   
1  1062                9439,9440   NNP_GIRM_0143,____   
2  5666                    13822        IBEIS_PZ_1866   
3  5713                     9269        NNP_GIRM_0149   
4  7748                     9253                 ____   
5  8838                    11554        IBEIS_PZ_0373   

                           NID  
0  -16167,-16168,-16169,-16170  
1                    234,-9440  
2                         2139  
3                          253  
4                        -9253  
5                          646  

Groupby aggregation in docs.

EDIT:

Alternatively you can use astype with join:

print (df.groupby('GID').agg(lambda x: ','.join(x.astype(str))).reset_index())
    GID                      AID      INDIVIDUAL_NAME  \
0   546  16167,16168,16169,16170  ____,____,____,____   
1  1062                9439,9440   NNP_GIRM_0143,____   
2  5666                    13822        IBEIS_PZ_1866   
3  5713                     9269        NNP_GIRM_0149   
4  7748                     9253                 ____   
5  8838                    11554        IBEIS_PZ_0373   

                           NID  
0  -16167,-16168,-16169,-16170  
1                    234,-9440  
2                         2139  
3                          253  
4                        -9253  
5                          646  
Comments