blueberryfields blueberryfields - 1 month ago 10
Scala Question

Can I auto-load csv headers from a separate file for a scala spark window on Zeppelin?

I have a data source which is stored as a large number of gzipped, csv files. The header info for this source is a separate file.

I'd like to load this data into spark for manipulation - is there an easy way to get spark to figure out the schema/load the headers? There are literally hundreds of columns, and they might change between runs, would strongly prefer not to do this by hand

Answer

This can easily be done in spark : if your header file is : headers.csv and it only contains header then simply first load this file with header set as true :

val headerCSV  = spark.read.format("CSV").option("header","true").load("/home/shivansh/Desktop/header.csv")

then get the Columns out in the form of Array:

val columns = headerCSV.columns

Then read the other file without the header information and pass this file as the header:

spark.read.format("CSV").load("/home/shivansh/Desktop/fileWithoutHeader.csv").toDF(columns:_*)

This will result in the DF with the combined value !