solarenqu solarenqu - 1 year ago 132
Scala Question

Spark NaiveBayesTextClassification

i'm trying to create a text classifier spark(1.6.2) app, but I don't know what am I doing wrong. This is my code:

import{NaiveBayes, NaiveBayesModel}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.mllib
import org.apache.spark.mllib.util.MLUtils
import{HashingTF, IDF, Tokenizer}

* Created by kebodev on 2016.11.29..
object PredTest {

def main(args: Array[String]): Unit = {

val conf = new SparkConf()
.set("spark.executor.memory", "2gb")

val sc = new SparkContext(conf)

val sqlContext = new SQLContext(sc)

val trainData ="src/main/resources/tst.json")

val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val wordsData = tokenizer.transform(trainData)
val hashingTF = new HashingTF()
val featurizedData = hashingTF.transform(wordsData)

val model = NaiveBayes.train(featurizedData)



object doesn't have train method, what should I import?

If i try to use this way:

val naBa = new NaiveBayes()

I get this exception:

Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column label must be of type DoubleType but was actually StringType.
at scala.Predef$.require(Predef.scala:224)
at PredTest$.main(PredTest.scala:37)
at PredTest.main(PredTest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at com.intellij.rt.execution.application.AppMain.main(

This is how my json file looks like:

{"text":"any text","label":"6.0"}

I'm really noob in this topic. Can anyone help me how to create a model, and then how to predict a new value.

Thank you!

Answer Source

Labels and Feature Vectors only contain Doubles. Your label column contains a String.

See your stacktrace:

Column label must be of type DoubleType but was actually StringType.

You can use the StringIndexer or CountVectorizer to convert it appropriately. See for further details.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download