Harish Harish - 29 days ago 6
Python Question

Pyspark OLD dataframe partition to New Dataframe

I have a partitioned dataframe say df1. From df1 i will create df2 and df3..

df1 = df1.withColumn("key", concat("col1", "col2", "col3"))
df1 =df1.repartition(400, "key")

df2 = df.groupBy("col1", "col2").agg(sum(colx))
df3 = df1.join(df2, ["col1", "col2"])


I want to know will df3 retain same partition of df1? or do i need to re-partition df3 again?.

Answer

Partitioning of df3 will be totally different comparing to df1. And (probably) df2 will have spark.sql.shuffle.partitions (default: 200) number of partitions, not 400.