Dimitris Dimitris - 24 days ago 14
Scala Question

Spark ML VectorAssembler returns strange output

I am experiencing a very strange behaviour from

VectorAssembler
and I was wondering if anyone else has seen this.

My scenario is pretty straightforward. I parse data from a
CSV
file where I have some standard
Int
and
Double
fields and I also calculate some extra columns. My parsing function returns this:

val joined = countPerChannel ++ countPerSource //two arrays of Doubles joined
(label, orderNo, pageNo, Vectors.dense(joinedCounts))


My main function uses the parsing function like this:

val parsedData = rawData.filter(row => row != header).map(parseLine)
val data = sqlContext.createDataFrame(parsedData).toDF("label", "orderNo", "pageNo","joinedCounts")


I then use a
VectorAssembler
like this:

val assembler = new VectorAssembler()
.setInputCols(Array("orderNo", "pageNo", "joinedCounts"))
.setOutputCol("features")

val assemblerData = assembler.transform(data)


So when I print a row of my data before it goes into the
VectorAssembler
it looks like this:

[3.2,17.0,15.0,[0.0,0.0,0.0,0.0,3.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,2.0]]


After the transform function of VectorAssembler I print the same row of data and get this:

[3.2,(18,[0,1,6,9,14,17],[17.0,15.0,3.0,1.0,4.0,2.0])]


What on earth is going on? What has the
VectorAssembler
done? I 've double checked all the calculations and even followed the simple Spark examples and cannot see what is wrong with my code. Can you?

Answer

There is nothing strange about the output. Your vector seems to have lots of zero elements thus Spark used a sparse representation of your Vector

To explain further :

It seems like your vector is composed of 18 elements (dimension)

This indices [0,1,6,9,14,17] from the vector contains non zero elements which are in order [17.0,15.0,3.0,1.0,4.0,2.0]

Sparse Vector representation is a way to save computational space thus easier and faster to compute. More on Sparse representation here.

Now of course you can convert that sparse representation to a dense representation but it comes at a cost.

Comments