UrVal UrVal - 1 year ago 114
Scala Question

Overwrite Spark dataframe schema

Based on this article it seems that Spark cannot edit and RDD or column. A new one has to be created with the new type and the old one deleted. The for loop and .withColumn method suggested below seem to be the easiest way to get the job done.

Is there a simple way (for both human and machine) to convert multiple columns to a different data type?

I tried to define the schema manually, then load the data from a parquet file using this schema and save it to another file but I get "Job aborted."..."Task failed while writing rows" every time and on every DF. Somewhat easy for me, laborious for Spark ... and it does not work.

Another option is using:

df = df.withColumn("new_col", df("old_col").cast(type)).drop("old_col").withColumnRenamed("new_col", "old_col")

A bit more work for me as there are close to 100 columns and, if Spark has to duplicate each column in memory, then that doesn't sound optimal either. Is there an easier way?

Answer Source

Depending on how complicated the casting rules are, you can accomplish what you are asking a with this loop:

scala> var df = Seq((1,2),(3,4)).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: int, b: int]

scala> df.show
|  a|  b|
|  1|  2|
|  3|  4|

scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> > df.columns.foreach{c => df = df.withColumn(c, df(c).cast(DoubleType))}

scala> df.show
|  a|  b|

This should be as efficient as any other column operation.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download