Mnemosyne Mnemosyne - 2 years ago 122
Scala Question

Why does Spark MLlib HashingTF output only 1D Vectors?

So I have this big dataframe with the format:


org.apache.spark.sql.DataFrame = [id: string, data: string]

Data is a very big set of words/indentifiers. It also contains unnecessary symbols like ["{ etc. which I need to clean up.

My solution for this clean up is:

val dataframe2 = sqlContext.createDataFrame(> Row(x.getString(0), x.getAs[String](1).replaceAll("[^a-zA-Z,_:]",""))), dataframe.schema)

I need to apply ML to this data so it should go to the pipeline like this.

  1. First Tokenizing, which gives out

org.apache.spark.sql.DataFrame = [id: string, data: string, tokenized_data: array<string>]

with output (without the


  1. StopWords Removal

org.apache.spark.sql.DataFrame = [id: string, data: string, tokenized_data: array<string>, newData: array<string>]

with output (without


  1. HashingTF

org.apache.spark.sql.DataFrame = [id: string, data: string, tokenized_data: array<string>, newData: array<string>, hashedData: vector]

and the vector looks like this:


each of the Arrays created as a result of the previous algorithms can contain from 0 up to dozens of features overall. And yet virtually all/most of my vectors are one dimensional. I want to do some clustering with this data but the 1 dimensionality is a big problem. Why is this happening and how can I fix it?

I figured out that the error happens precisely when I clean up the data. If I don't do the clean up, HashingTF performs normally. What am I doing wrong in the clean up and how can I perform a similar clean up without messing with the format?

Answer Source

[^a-zA-Z,_:] matches all whitespaces. It results in a single continuous string which when tokenized creates a single token and a Vector with one entry. You should exclude whitespaces or use regex tokenizer as replacement.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download