Anandan - 15 days ago 6

Java Question

I am trying to find similar hashes (hexadecimal hash) using hamming and Levenshtein distance. Lets say two hashes are similar if their hamming distance is less than 10 (number of differing bits).

`Hash 1= ffffff (base 16)`

Hash 2= fffff0 (base 16)

The hamming distance between two hashes is 4. They are similar. Because,

`Hash 1= 11111111 11111111 11111111 (base 2)`

Hash 2= 11111111 11111111 11110000 (base 2)

I have 8 million such hashes. I am wondering what will be a suitable data structure for storing the 8 million hashes. I initially tried "Trie" but consider the following scenario,

`Hash 1 = 0fabde (00001111 10101011 11011110)`

Hash 2 = adcbfe (10101010 11001011 11111110)

The hamming distance is 7. So I cannot do prefix search.

I know that i can use XOR and Integer.bitCount() to get the number of differing bits, but I have one target hash and 8 million hashes to search against i.e Given a hash i have to find all the similar hashes in 8 million hashes that we have in repository.

Is there any way store the hashes effectively so that my search base is reduced?

Answer

If the hashes are as small as shown, you can index them "directly" - that is, put them in a big array and do just do some math on the index.

It's fairly simple to generate only the indexes that may correspond to hashes that are within the requested hamming distance `d`

, just XOR the key with all masks that contain up to `d`

set bits (see below). Since there are 8 million hashes but only 16 million could exist, about half of the visited indexes are expected to be "useful" ie there will be something there to find.

To generate the masks, you can use the old NextBitPermutation trick, which has been posted on StackOverflow several times before, for example here. For java, just use the logical right shift and replace `__builtin_ctz`

by `numberOfTrailingZeros`

to get (not tested)

```
int t = v | (v - 1);
int w = (t + 1) | (((~t & -~t) - 1) >>> (Integer.numberOfTrailingZeros(v) + 1));
```

Here `w`

would be the bit permutation after `v`

.

The global structure would be something like (not tested)

```
for (int k = 1; k <= d; k++) {
int diff = (1 << k) - 1;
while (diff <= 0xFFFFFF) {
if (hashes[key ^ diff])
// do something with it
diff = nextBitPermutation(diff);
}
}
```