Johnny000 Johnny000 - 7 months ago 18
Java Question

Spark MLLib TFIDF implementation for LogisticRegression

I try to use the new TFIDF algorithem that spark 1.1.0 offers. I'm writing my job for MLLib in Java but I can't figure out how to get the TFIDF implementation working. For some reason IDFModel only accepts a JavaRDD as input for the method transform and not simple Vector. How can I use the given classes to model a TFIDF vector for my LabledPoints?

Note: The document lines are in the format [Label; Text]




Here my code so far:

// 1.) Load the documents
JavaRDD<String> data = sc.textFile("/home/johnny/data.data.new");

// 2.) Hash all documents
HashingTF tf = new HashingTF();
JavaRDD<Tuple2<Double, Vector>> tupleData = data.map(new Function<String, Tuple2<Double, Vector>>() {
@Override
public Tuple2<Double, Vector> call(String v1) throws Exception {
String[] data = v1.split(";");
List<String> myList = Arrays.asList(data[1].split(" "));
return new Tuple2<Double, Vector>(Double.parseDouble(data[0]), tf.transform(myList));
}
});

tupleData.cache();

// 3.) Create a flat RDD with all vectors
JavaRDD<Vector> hashedData = tupleData.map(new Function<Tuple2<Double,Vector>, Vector>() {
@Override
public Vector call(Tuple2<Double, Vector> v1) throws Exception {
return v1._2;
}
});

// 4.) Create a IDFModel out of our flat vector RDD
IDFModel idfModel = new IDF().fit(hashedData);

// 5.) Create Labledpoint RDD with TFIDF
???


Solution from Sean Owen:

// 1.) Load the documents
JavaRDD<String> data = sc.textFile("/home/johnny/data.data.new");

// 2.) Hash all documents
HashingTF tf = new HashingTF();
JavaRDD<LabeledPoint> tupleData = data.map(v1 -> {
String[] datas = v1.split(";");
List<String> myList = Arrays.asList(datas[1].split(" "));
return new LabeledPoint(Double.parseDouble(datas[0]), tf.transform(myList));
});
// 3.) Create a flat RDD with all vectors
JavaRDD<Vector> hashedData = tupleData.map(label -> label.features());
// 4.) Create a IDFModel out of our flat vector RDD
IDFModel idfModel = new IDF().fit(hashedData);
// 5.) Create tfidf RDD
JavaRDD<Vector> idf = idfModel.transform(hashedData);
// 6.) Create Labledpoint RDD
JavaRDD<LabeledPoint> idfTransformed = idf.zip(tupleData).map(t -> {
return new LabeledPoint(t._2.label(), t._1);
});

Answer

IDFModel.transform() accepts a JavaRDD or RDD of Vector, as you see. It does not make sense to compute a model over a single Vector, so that's not what you're looking for right?

I assume you're working in Java, so you mean you want to apply this to a JavaRDD<LabeledPoint>. LabeledPoint contains a Vector and a label. IDF is not a classifier or regressor, so it needs no label. You can map a bunch of LabeledPoint to just extract their Vector.

But you already have a JavaRDD<Vector> above. TF-IDF is merely a way of mapping words to real-valued features based on word frequencies in the corpus. It also does not output a label. Maybe you mean you want to develop a classifier from TF-IDF-derived feature vectors, and some other labels you already have?

Maybe that clears things up but otherwise you'd have to greatly clarify what you are trying to achieve with TF-IDF.

Comments