Michael Michael - 3 years ago 193
Python Question

Order of rows in DataFrame after aggregation

Suppose I've got a data frame

df
(created from a hard-coded array for tests)

+----+----+---+
|name| c1|qty|
+----+----+---+
| a|abc1| 1|
| a|abc2| 0|
| b|abc3| 3|
| b|abc4| 2|
+----+----+---+


I am grouping and aggregating it to get
df1


import pyspark.sql.functions as sf

df1 = df.groupBy('name').agg(sf.min('qty'))
df1.show()
+----+--------+
|name|min(qty)|
+----+--------+
| b| 2|
| a| 0|
+----+--------+


What is the expected order of the rows in
df1
?

Suppose now that I am writing a unit test. I need to compare
df1
with the expected data frame. Should I compare them ignoring the order of rows. What is the best way to do it ?

Answer Source

The ordering of the rows in the dataframe is not fixed. There is an easy way to use the expected Dataframe in test cases

Do a dataframe diff . For scala:

   assert(df1.except(expectedDf).count == 0)

And

   assert(expectedDf.except(df1).count == 0)

For python you need to replace except by subtract

From documentation:

subtract(other) Return a new DataFrame containing rows in this frame but not in another frame.

This is equivalent to EXCEPT in SQL.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download