Greg Peckory Greg Peckory - 17 days ago 4
Java Question

Double Hashing efficiency with word dictionary

Simply put I have a dictionary of words and I'm adding them to a hash table.

I am using Double Hashing (not the conventional method) and the following is yielding the best result.

public static int getHashKey(String word) {

int index = 0;

for(int i = 0; i<word.length(); i++){

index += Math.pow(4, i)*((int)word.charAt(i));
index = index % size;
}
return index;
}

public static int getDoubleHashKey(String word) {

int jump = 1;

for(int i = 0; i<word.length(); i++){

jump = jump * word.charAt(i);
jump = jump % size;
}
return jump;

}


This is giving me 127,000 collisions. I also have a 2 fold prime hash table size and it cannot be changed.

Is there any way the Double Hashing algorithm can be improved? (Either of the 2 methods above).

I know it depends on what we are storing in the hash table etc. but is there any intuitive method or some tips that apply more generally so I can avoid a couple more collisions.

Answer

I ran a little Scala program on a dictionary of about 336 531 entries. There are significantly less collisions for the version 2 (118 142) than for the version 1 (305 431). Notice that the version 2 is close to an optimal number of collisions because 118 142 + 216 555 = 334 697, so 334 697/336 531 = 99,46% of values used in the 0-216555 range. Using the modulo outside the loop does improve your hash method.

import scala.io.Source

object Hash extends App {
    val size = 216555
    def doubleHashKey1(word: String) = {
        var jump = 1;
        for (ch <- word) {
            jump = jump * ch;
            jump = jump % size;
        }
        jump
    }

    def doubleHashKey2(word: String) = {
        var jump = 1;
        for (ch <- word) jump = jump * ch;
        jump % size;
    }

    def countCollisions(words: Set[String], hashFun: String => Int) = words.size - words.map(hashFun).size
    def readDictionary(path: String) = Source.fromFile(path).getLines.toSet

    val dict = readDictionary("words.txt")
    println(countCollisions(dict,doubleHashKey1))
    println(countCollisions(dict,doubleHashKey2))
}

For handling the integer overflow, you must use a different (but trivial to implement) way to compute the modulo in order to return positive values. Another test to do would be to see if the collisions are statically well distributed.