view raw
UrVal UrVal - 7 months ago 48
Scala Question

Overwrite Spark dataframe schema

Based on this article it seems that Spark cannot edit and RDD or column. A new one has to be created with the new type and the old one deleted. The for loop and .withColumn method suggested below seem to be the easiest way to get the job done.

Is there a simple way (for both human and machine) to convert multiple columns to a different data type?

I tried to define the schema manually, then load the data from a parquet file using this schema and save it to another file but I get "Job aborted."..."Task failed while writing rows" every time and on every DF. Somewhat easy for me, laborious for Spark ... and it does not work.

Another option is using:

df = df.withColumn("new_col", df("old_col").cast(type)).drop("old_col").withColumnRenamed("new_col", "old_col")

A bit more work for me as there are close to 100 columns and, if Spark has to duplicate each column in memory, then that doesn't sound optimal either. Is there an easier way?


Depending on how complicated the casting rules are, you can accomplish what you are asking a with this loop:

scala> var df = Seq((1,2),(3,4)).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: int, b: int]

|  a|  b|
|  1|  2|
|  3|  4|

scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> > df.columns.foreach{c => df = df.withColumn(c, df(c).cast(DoubleType))}

|  a|  b|

This should be as efficient as any other column operation.