Jack Doce Jack Doce - 10 days ago 9
Scala Question

(Scala) Convert String to Date in Apache Spark

I would like to read a .csv file with Spark and associate the columns with fitting Types.

val conf = new SparkConf()
.setMaster("local[8]")
.setAppName("Name")

val sc = new SparkContext(conf)

val sqlContext = new SQLContext(sc)

val customSchema = StructType(Array(
StructField("date", DateType, true),
StructField("time",StringType, true),
StructField("am", DoubleType, true),
StructField("hum", DoubleType, true),
StructField("temp", DoubleType, true)
))

val df = sqlContext.read
.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")
.option("header","true")
.option("delimiter",";")
.schema(customSchema)
.load("data.csv")


A line of the .csv I am reading looks like this

+----------+--------+-----+-----+-----+
| date| time| am| hum| temp|
+----------+--------+-----+-----+-----+
|04.10.2016|12:51:20|1.121|0.149|0.462|
+----------+--------+-----+-----+-----+


Spark will read the .csv and associate the Types correctly if I set the type for the date to String. If I keep the customSchema like in the code shown above, Spark will throw an exception due to the wrong date format
(DateType will expect YYYY-MM-DD while mine is DD.MM.YYYY).


Is there a way to re-format the date Strings to YYYY-MM-DD and apply the schema afterwards? Or can I also alter the DateType given by Spark by adding parameters?

Thanks in advance

Answer

Use dateFormat option:

val df = sqlContext.read
  .format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")
  .option("header","true")
  .option("delimiter",";")
  .option("dateFormat", "dd.MM.yyyy")
  .schema(customSchema)
  .load("data.csv")