Mnemosyne Mnemosyne - 2 months ago 10
Scala Question

Why does Spark MLlib HashingTF output only 1D Vectors?

So I have this big dataframe with the format:

dataframe:

org.apache.spark.sql.DataFrame = [id: string, data: string]


Data is a very big set of words/indentifiers. It also contains unnecessary symbols like ["{ etc. which I need to clean up.

My solution for this clean up is:

val dataframe2 = sqlContext.createDataFrame(dataframe.map(x=> Row(x.getString(0), x.getAs[String](1).replaceAll("[^a-zA-Z,_:]",""))), dataframe.schema)


I need to apply ML to this data so it should go to the pipeline like this.


  1. First Tokenizing, which gives out



org.apache.spark.sql.DataFrame = [id: string, data: string, tokenized_data: array<string>]


with output (without the
data
column)

[id1,WrappedArray(ab,abc,nuj,bzu...)]



  1. StopWords Removal



org.apache.spark.sql.DataFrame = [id: string, data: string, tokenized_data: array<string>, newData: array<string>]


with output (without
data
and
tokenized_data
)

[id1,WrappedArray(ab,abc,nuj,bzu...)]



  1. HashingTF



org.apache.spark.sql.DataFrame = [id: string, data: string, tokenized_data: array<string>, newData: array<string>, hashedData: vector]


and the vector looks like this:

[id1,(262144,[236355],[1.0])]
[id2,(262144,[152325],[1.0])]
[id3,(262144,[27653],[1.0])]
[id4,(262144,[199400],[1.0])]
[id5,(262144,[82931],[1.0])]


each of the Arrays created as a result of the previous algorithms can contain from 0 up to dozens of features overall. And yet virtually all/most of my vectors are one dimensional. I want to do some clustering with this data but the 1 dimensionality is a big problem. Why is this happening and how can I fix it?

I figured out that the error happens precisely when I clean up the data. If I don't do the clean up, HashingTF performs normally. What am I doing wrong in the clean up and how can I perform a similar clean up without messing with the format?

Answer

[^a-zA-Z,_:] matches all whitespaces. It results in a single continuous string which when tokenized creates a single token and a Vector with one entry. You should exclude whitespaces or use regex tokenizer as replacement.