Jim Hendricks Jim Hendricks - 1 month ago 9
Scala Question

How to create a Dataframe programmatically that isn't StringType

I'm building a schema that is rather large so I am using the example of progamatical schema creation from the documentation.

val schemaString = "field1,...,field126"
val schema = StructType(schemaString.split(",").map(fieldName => StructField(fieldName.trim, StringType, true)))


This works fine but I need to have all fields as DoubleType for my ML function. I changed the StringType to DoubleType and I get an error.

val schemaString = "field1,...,field126"
val schema = StructType(schemaString.split(",").map(fieldName => StructField(fieldName.trim, DoubleType, true)))


Error:

Exception in thread "main" java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double
at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)


I know I can shift to creating the schema manually but with 126 fields the code gets bulky.

val schema = new StructType()
.add("ColumnA", IntegerType)
.add("ColumnB", StringType)

val df = sqlContext.read
.schema(schema)
.format("com.databricks.spark.csv")
.delimiter(",")
.load("/path/to/file.csv")

Answer

I think there is no need to pass your own schema , It will infer it automatically , if your csv file contains the name of the columns then it will take it too if you set the header as true.

This will work simply(not-tested) :

val df = sqlContext.read
  .format("com.databricks.spark.csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("data/sample.csv")

It will give you a dataframe and if you have the column name to then just set header as true !

Comments