Dimitris Dimitris - 2 months ago 30
Scala Question

Spark ML VectorAssembler returns strange output

I am experiencing a very strange behaviour from

and I was wondering if anyone else has seen this.

My scenario is pretty straightforward. I parse data from a
file where I have some standard
fields and I also calculate some extra columns. My parsing function returns this:

val joined = countPerChannel ++ countPerSource //two arrays of Doubles joined
(label, orderNo, pageNo, Vectors.dense(joinedCounts))

My main function uses the parsing function like this:

val parsedData = rawData.filter(row => row != header).map(parseLine)
val data = sqlContext.createDataFrame(parsedData).toDF("label", "orderNo", "pageNo","joinedCounts")

I then use a
like this:

val assembler = new VectorAssembler()
.setInputCols(Array("orderNo", "pageNo", "joinedCounts"))

val assemblerData = assembler.transform(data)

So when I print a row of my data before it goes into the
it looks like this:


After the transform function of VectorAssembler I print the same row of data and get this:


What on earth is going on? What has the
done? I 've double checked all the calculations and even followed the simple Spark examples and cannot see what is wrong with my code. Can you?


There is nothing strange about the output. Your vector seems to have lots of zero elements thus Spark used a sparse representation of your Vector

To explain further :

It seems like your vector is composed of 18 elements (dimension)

This indices [0,1,6,9,14,17] from the vector contains non zero elements which are in order [17.0,15.0,3.0,1.0,4.0,2.0]

Sparse Vector representation is a way to save computational space thus easier and faster to compute. More on Sparse representation here.

Now of course you can convert that sparse representation to a dense representation but it comes at a cost.