Legit Stack Legit Stack - 3 months ago 11
Python Question

Pandas getting all rows listed in one dataframe but not the other UNORDERD

I cannot find an easy way to get all the rows of a data frame that are found in one dataframe but not a second dataframe if the data is unordered.

These two answers talk are solutions for ordered data:


Get rows that are present in one dataframe, but not the other

pandas get rows which are NOT in other dataframe


So just to make it clear I'm trying to get this:
data from one dataframe thats not found in the other dataframe

In one of those related question mentioned above I found a multiindexing solution that supposedly works with unordered data, but I was unable to implement it. I am hoping theres an easier way.

let me give you an example of the data I'm working with:

DF1
col_a col_b
1325 foo
1397 foo #<---matching value, but not matching index in DF2
1645 foo
... ...

DF2
col_1 col_2
1397 foo #<---matching value, but not matching index in DF1
1500 foo
1621 foo
... ...


Now if that were all the data in both dataframes the result for processing this specifically for DF1 would look like this:

DF1_UNIQUE
col_a col_b
1325 foo
1645 foo


(So I'm really only caring about
col_a
or for DF2
col_1
). Notice its missing the 1397 row. that's because it is found in DF2, so I don't want it returned to my new DF. But its not found in the same index and there in lies the problem I have. I already easily created a solution if all the matching indexes are lined up, but I don't know where to start on the indexes that aren't lined up. Can I use the merge function? Or is that the wrong tool for this job?

This code isn't entirely relevant but its the solution I came up with if all the indexes lined up correctly:

def getUniqueEntries(df1, df2):
"""takes two dataframes, returns a dataframe that is comprized of all the rows unique to the first dataframe."""
d1columns = df1.columns
d2columns = df2.columns
df3 = pd.merge(df1, df2, left_on=d1columns[0], right_on=d2columns[0])
print(df3)
return df1[(~df1[d1columns[0]].isin(df3[d1columns[0]]))]

def main(fileread1, fileread2, writeprefix):
df1 = pd.read_csv(fileread1)
df2 = pd.read_csv(fileread2)

df3 = getUniqueEntries(df1, df2)
df4 = getUniqueEntries(df2, df1)

print(df3)
print(df4)

df3.to_csv(writeprefix+fileread1, index=False)
df4.to_csv(writeprefix+fileread2, index=False)

if __name__ == '__main__':
main(sys.argv[1], sys.argv[2], sys.argv[3])

Answer

Yes, you can use merge with the indicator parameter:

I renamed the columns to avoid duplicated columns You can also pass left_on and right_on

merged = DF1.merge(DF2.rename(columns={'col_1': 'col_a', 'col_2': 'col_b'}), how='left', indicator=True)
merged
Out: 
   col_a col_b     _merge
0   1325   foo  left_only
1   1397   foo       both
2   1645   foo  left_only

Now, you can filter merged using the indicator column:

merged[merged['_merge']=='left_only']
Out: 
   col_a col_b     _merge
0   1325   foo  left_only
2   1645   foo  left_only