ML_Pro ML_Pro -4 years ago 152
Python Question

How to concatenate/append multiple Spark dataframes column wise in Pyspark?

How to do pandas equivalent of pd.concat([df1,df2],axis='columns') using Pyspark dataframes?
I googled and couldn't find a good solution.

DF1
var1
3
4
5

DF1
var2 var3
23 31
44 45
52 53

Expected output dataframe
var1 var2 var3
3 23 31
4 44 45
5 52 53


Edited to include expected output

Answer Source

Below is the example for what you want to do but in scala, I hope you can convert it to pyspark

val spark = SparkSession
    .builder()
    .master("local")
    .appName("ParquetAppendMode")
    .getOrCreate()
  import spark.implicits._

  val df1 = spark.sparkContext.parallelize(Seq(
    (1, "abc"),
    (2, "def"),
    (3, "hij")
  )).toDF("id", "name")

  val df2 = spark.sparkContext.parallelize(Seq(
    (19, "x"),
    (29, "y"),
    (39, "z")
  )).toDF("age", "address")

  val schema = StructType(df1.schema.fields ++ df2.schema.fields)

  val df1df2 = df1.rdd.zip(df2.rdd).map{
    case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}

  spark.createDataFrame(df1df2, schema).show()

Hope this helps!

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download