ab3 ab3 - 2 months ago 24x
Scala Question

How to read a parquet file with lots of columns to a Dataset without a custom case class?

I want to use datasets instead of dataframes.

I'm reading a parquet file and want to infer the types directly:

val df: Dataset[Row] = spark.read.parquet(path)

I don't want
but a

I know I can do something like:

val df= spark.read.parquet(path).as[myCaseClass]

but, my data has many columns! so, if I can avoid writing a case class it would be great!


Why do you want to work with a Dataset? I think it's because you will have not only the schema for free (which you have with the result DataFrame anyway) but because you will have a type-safe schema.

You need to have an Encoder for your dataset and to have it you need a type that would represent your dataset and hence the schema.

Either you select your columns to a reasonable number and use as[MyCaseClass] or you should accept what DataFrame offers.