Jack Doce Jack Doce - 14 days ago 6
Scala Question

Read .csv data in european format with Spark

I am currently doing my first attempts with Apache Spark.
I would like to read a .csv File with an SQLContext object, but Spark won't provide the correct results as the File is a european one (comma as decimal separator and semicolon used as value separator).
Is there a way to tell Spark to follow a different .csv syntax?

val conf = new SparkConf()
.setMaster("local[8]")
.setAppName("Foo")

val sc = new SparkContext(conf)

val sqlContext = new SQLContext(sc)

val df = sqlContext.read
.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")
.option("header","true")
.option("inferSchema","true")
.load("data.csv")

df.show()


A row in the relating .csv looks like this:

04.10.2016;12:51:00;1,1;0,41;0,416


Spark interprets the entire row as a column.
df.show()
prints:

+--------------------------------+
|Col1;Col2,Col3;Col4;Col5 |
+--------------------------------+
| 04.10.2016;12:51:...|
+--------------------------------+


In previous attempts to get it working
df.show()
was even printing more row-content where it now says '...' but eventually cutting the row at the comma in the third col.

Answer

You can just read as Test and split by ; or set a custom delimiter to the CSV format as in .option("delimiter",";")