Michael Michael - 3 years ago 193
Python Question

Order of rows in DataFrame after aggregation

Suppose I've got a data frame

(created from a hard-coded array for tests)

|name| c1|qty|
| a|abc1| 1|
| a|abc2| 0|
| b|abc3| 3|
| b|abc4| 2|

I am grouping and aggregating it to get

import pyspark.sql.functions as sf

df1 = df.groupBy('name').agg(sf.min('qty'))
| b| 2|
| a| 0|

What is the expected order of the rows in

Suppose now that I am writing a unit test. I need to compare
with the expected data frame. Should I compare them ignoring the order of rows. What is the best way to do it ?

Answer Source

The ordering of the rows in the dataframe is not fixed. There is an easy way to use the expected Dataframe in test cases

Do a dataframe diff . For scala:

   assert(df1.except(expectedDf).count == 0)


   assert(expectedDf.except(df1).count == 0)

For python you need to replace except by subtract

From documentation:

subtract(other) Return a new DataFrame containing rows in this frame but not in another frame.

This is equivalent to EXCEPT in SQL.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download