view raw
Pi Pi Pi Pi - 6 months ago 49
Scala Question

How to get Vector in DataFrame

I get some Feature Vector using SparkML TF-IDF algorithm. Now I want to get the Vector in the column of "idfFeatures".

enter image description here

My code is:

val vectors ="idfFeatures").map{
case Row(vector: Vector) =>

There is a bug in console:

Error:(38, 24) type Vector takes type parameters
case Row(vector: Vector) =>

If I change Vector to String, there is another bug:

scala.MatchError: [(262144,[622,4200,7303,8501......,2.1972245773362196,1.2809338454620642])] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
at scala.TFIDFTest2$$anonfun$1.apply(TFIDFTest2.scala:37)

How can I get the Vector?


Spark 1.x:

import org.apache.spark.mllib.linalg.Vector

Spark 2.0:




import{HashingTF, IDF, Tokenizer}

val sentenceData = spark.createDataFrame(Seq(
  (0, "Hi I heard about Spark"),
  (0, "I wish Java could use case classes"),
  (1, "Logistic regression models are neat")
)).toDF("label", "sentence")

val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)
val hashingTF = new HashingTF()
val featurizedData = hashingTF.transform(wordsData)

val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel =

val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel =
val rescaledData = idfModel.transform(featurizedData)
import org.apache.spark.sql.Row"features") { case Row(v: Vector) => v}.first