KyBe KyBe - 17 days ago 10
Scala Question

Hamming Distance on binary Vector in Scala

I would like to have a fast implementation of Hamming distance on binary vectors.
I tested it on

Array[Byte]
rather than
Array[Int]
thinking that it would be faster but it's not the case.
If someone can explain me this behaviour and/or advise me on a better implementation.

def hammingDistanceI(v1:Array[Int], v2:Array[Int]) = {
v1.zip(v2).count{case(a,b) => a!=b}
}
def hammingDistanceB(v1:Array[Byte], v2:Array[Byte]) = {
v1.zip(v2).count{case(a,b) => a!=b}
}

def speedMeasureByte(v:Array[Byte], nbIte:Int) = {
val t0 = System.nanoTime
for(i<-0 to nbIte-1) hammingDistanceB(v,v)
val t1 = System.nanoTime
(t1-t0)/1000000
}

def speedMeasureInt(v:Array[Int], nbIte:Int) = {
val t0 = System.nanoTime
for(i<-0 to nbIte-1) hammingDistanceI(v,v)
val t1 = System.nanoTime
(t1-t0)/1000000
}

val v1Int = Array.fill(100)(Random.nextInt(2))
val v1Byte = v1Int.map(_.toByte)

val (tInt, tByte) = (speedMeasureInt(v1Int,1000000),
speedMeasureByte(v1Byte,1000000))

// tInt = 1636 ms
// tByte = 3307 ms

Answer

I am not sure why byte implementation is slower than the other, but suspect it has to do with the way != is implemented - cpu registers are better equipped to deal with four-byte sequences nowadays than with single bytes.

The above is just my guess though, don't bet your house on it.

As for a faster implementation, if your use case is such, where single nanoseconds matter, you'll have to abandon the elegance of scala collections and stick with the old good loops:

 def hd(a: Array[Int], b: Array[Int]) { 
   var c = 0
   var i = 0
   while(i < a.length) { c += a(i)^b(i); i+=1 }
   c
 }

This should be several hundred times faster on average than your implementation.