josh1234 josh1234 - 9 months ago 55
Python Question

pandas error when comparing two dataframe with sort

I have multiple old tables that I need to compare to new tables, and I want to use pandas to compare them.

The data looks something like this:

data_old = [
["John", 1,2,3,None],
["Mary", 1,4,None,"Apples"],
["Jean", None,4,-4,"Peaches"],
columns = ['name', "A", "B", "C", "D"]
data_new = [
["Jean", 5,4,-4,"Peaches"],
["John", 1,-2,3,None],
["Mary", 1,4,None,"Apples"],

Here we have two relatively small datasets, and the values can be text, numeric, or NULL. I wanted to create sub-sets dataframe with the changes and export report dataframes to a csv.

My issue is this, when I sort both dataframes by 'name', I get the following error:

ValueError: Can only compare identically-labeled DataFrame objects

Here is what I am doing:

df_old = pd.DataFrame(data=data_old, columns=columns)
df_old.sort(columns='name', inplace=True)
df_new = pd.DataFrame(data=data_new, columns=columns)
df_new.sort(columns='name', inplace=True)
ne = (df_old != df_new).any(1) #ERROR
# to other stuff.....

Answer Source

I think you need sort_values instead sort, because:

FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)

and then need same index values in both DataFrames what return reset_index with parameter drop=True:

df_old = pd.DataFrame(data=data_old, columns=columns)
df_old.sort_values('name', inplace=True)
df_new = pd.DataFrame(data=data_new, columns=columns)
df_new.sort_values('name', inplace=True)
print (df_old)
   name    A  B    C        D
2  Jean  NaN  4 -4.0  Peaches
0  John  1.0  2  3.0     None
1  Mary  1.0  4  NaN   Apples

print (df_new)
   name  A  B    C        D
0  Jean  5  4 -4.0  Peaches
1  John  1 -2  3.0     None
2  Mary  1  4  NaN   Apples

df_old.reset_index(drop=True, inplace=True)
df_new.reset_index(drop=True, inplace=True)

ne = (df_old != df_new).any(1)
print (ne)
0    True
1    True
2    True
dtype: bool