Kai Kai - 5 months ago 129
Java Question

writing Spark Dataframe to JSON loses format for MLLIB Sparse Vector

I am writing a (Java) Spark Dataframe to json. One of the columns is an mllib sparse vector. Later I read the json file into a second Dataframe, but the sparse vector column is now a WrappedArray and is not read as a sparse vector in the second data frame. My question: is there anything I can do on the writing side OR the reading side in order to get a sparse vector column NOT a wrappedArray column?

Writing:

initialDF.coalesce(1).write().json("initial_dataframe");


Reading:

DataFrame secondDF = hiveContext.read().json("initial_dataframe");

Answer

The answer is simple. Provide schema for the DataFrameReader

import org.apache.spark.mllib.linalg.VectorUDT

val path: String = ???
val df = Seq((1L, Vectors.parse("(5, [1.0, 3.0], [2.0, 3.0])"))).toDF
df.write.json(path)

spark.read.json(path).printSchema
// root
//  |-- _1: long (nullable = true)
//  |-- _2: struct (nullable = true)
//  |    |-- indices: array (nullable = true)
//  |    |    |-- element: long (containsNull = true)
//  |    |-- size: long (nullable = true)
//  |    |-- type: long (nullable = true)
//  |    |-- values: array (nullable = true)
//  |    |    |-- element: double (containsNull = true)

When correct schema is provided

import org.apache.spark.mllib.linalg.VectorUDT
import org.apache.spark.sql.types.{LongType, StructField, StructType}

val schema = StructType(Seq(
  StructField("_1", LongType, true),
  StructField("_2", new VectorUDT, true)))

spark.read.schema(schema).json(path).printSchema
root
 |-- _1: long (nullable = true)
 |-- _2: vector (nullable = true)

spark.read.schema(schema).json(path).show(1)
// +---+-------------------+
// | _1|                 _2|
// +---+-------------------+
// |  1|(5,[1,3],[2.0,3.0])|
// +---+-------------------+

In general if you work with sources which don't provide mechanisms for schema discovery providing schema explicitly is a good idea.

If JSON is not a hard requirement Parquet will both preserve vector types and provide schema discovery mechanisms.

Comments