Marcus Renno Marcus Renno - 2 months ago 55
Python Question

Pandas check if row exist in another dataframe and append index

I'm having one problem to iterate over my dataframe. The way I'm doing is taking a loooong time and I don't have that many rows (I have like 300k rows)

What am I trying to do?


  1. Check if one DF (A) contains the value of two columns of the other DF (B). You can think this as a multiple key field

  2. If True, get the index of DF.B and assign to one column of DF.A

  3. If False, two steps:

    a. append to DF.B the two columns not found

    b. assign the new ID to DF.A (I couldn't do this one)



This is my code, where:


  1. df is DF.A and df_id is DF.B:

  2. SampleID and ParentID are the two columns I am interested to check if they exist in both dataframes

  3. Real_ID is the column which I want to assign the id of DF.B (df_id)

    for index, row in df.iterrows():
    #check if columns exist in the other dataframe
    real_id = df_id[(df_id['SampleID'] == row['SampleID']) & (df_id['ParentID'] == row['ParentID'])]

    if real_id.empty:
    #row does not exist, append to df_id
    df_id = df_id.append(row[['SampleID','ParentID']])
    else:
    #row exists, assign id of df_id to df
    row['Real_ID'] = real_id.index



EXAMPLE:

DF.A (df)

Real_ID SampleID ParentID Something AnotherThing
0 20 21 a b
1 10 11 a b
2 40 51 a b


DF.B (df_id)

SampleID ParentID
0 10 11
1 20 21


Result:

Real_ID SampleID ParentID Something AnotherThing
0 1 10 11 a b
1 0 20 21 a b
2 2 40 51 a b


SampleID ParentID
0 20 21
1 10 11
2 40 51


Again, this solution is very slow. I'm sure there is a better way to do this and that's why I'm asking here. Unfortunately this was what I got after some hours...

Thanks

Answer

you can do it this way:

Data (pay attention at the index in the B DF):

In [276]: cols = ['SampleID', 'ParentID']

In [277]: A
Out[277]:
   Real_ID  SampleID  ParentID Something AnotherThing
0      NaN        10        11         a            b
1      NaN        20        21         a            b
2      NaN        40        51         a            b

In [278]: B
Out[278]:
   SampleID  ParentID
3        10        11
5        20        21

Solution:

In [279]: merged = pd.merge(A[cols], B, on=cols, how='outer', indicator=True)

In [280]: merged
Out[280]:
   SampleID  ParentID     _merge
0        10        11       both
1        20        21       both
2        40        51  left_only


In [281]: B = pd.concat([B, merged.ix[merged._merge=='left_only', cols]])

In [282]: B
Out[282]:
   SampleID  ParentID
3        10        11
5        20        21
2        40        51

In [285]: A['Real_ID'] = pd.merge(A[cols], B.reset_index(), on=cols)['index']

In [286]: A
Out[286]:
   Real_ID  SampleID  ParentID Something AnotherThing
0        3        10        11         a            b
1        5        20        21         a            b
2        2        40        51         a            b
Comments