superche superche - 1 month ago 21
Python Question

Same consistent-hashing algorithm implementation for Java and Python program

We have an app that the Python module will write data to redis shards and the Java module will read data from redis shards, so I need to implement the exact same consistent hashing algorithm for Java and Python to make sure the data can be found.




I googled around and tried several implementations, but found the Java and Python implementations are always different, can't be used togather. Need your help.

Edit, online implementations I have tried:

Java: http://weblogs.java.net/blog/tomwhite/archive/2007/11/consistent_hash.html

Python: http://techspot.zzzeek.org/2012/07/07/the-absolutely-simplest-consistent-hashing-example/

http://amix.dk/blog/post/19367

Edit, attached Java (Google Guava lib used) and Python code I wrote. Code are based on the above articles.

import java.util.Collection;
import java.util.SortedMap;
import java.util.TreeMap;
import com.google.common.hash.HashFunction;

public class ConsistentHash<T> {
private final HashFunction hashFunction;
private final int numberOfReplicas;
private final SortedMap<Long, T> circle = new TreeMap<Long, T>();

public ConsistentHash(HashFunction hashFunction, int numberOfReplicas,
Collection<T> nodes) {
this.hashFunction = hashFunction;
this.numberOfReplicas = numberOfReplicas;

for (T node : nodes) {
add(node);
}
}

public void add(T node) {
for (int i = 0; i < numberOfReplicas; i++) {
circle.put(hashFunction.hashString(node.toString() + i).asLong(),
node);
}
}

public void remove(T node) {
for (int i = 0; i < numberOfReplicas; i++) {
circle.remove(hashFunction.hashString(node.toString() + i).asLong());
}
}

public T get(Object key) {
if (circle.isEmpty()) {
return null;
}
long hash = hashFunction.hashString(key.toString()).asLong();
if (!circle.containsKey(hash)) {
SortedMap<Long, T> tailMap = circle.tailMap(hash);
hash = tailMap.isEmpty() ? circle.firstKey() : tailMap.firstKey();
}
return circle.get(hash);
}
}


Test code:

ArrayList<String> al = new ArrayList<String>();
al.add("redis1");
al.add("redis2");
al.add("redis3");
al.add("redis4");

String[] userIds =
{"-84942321036308",
"-76029520310209",
"-68343931116147",
"-54921760962352"
};
HashFunction hf = Hashing.md5();

ConsistentHash<String> consistentHash = new ConsistentHash<String>(hf, 100, al);
for (String userId : userIds) {
System.out.println(consistentHash.get(userId));
}


Python code:

import bisect
import md5

class ConsistentHashRing(object):
"""Implement a consistent hashing ring."""

def __init__(self, replicas=100):
"""Create a new ConsistentHashRing.

:param replicas: number of replicas.

"""
self.replicas = replicas
self._keys = []
self._nodes = {}

def _hash(self, key):
"""Given a string key, return a hash value."""

return long(md5.md5(key).hexdigest(), 16)

def _repl_iterator(self, nodename):
"""Given a node name, return an iterable of replica hashes."""

return (self._hash("%s%s" % (nodename, i))
for i in xrange(self.replicas))

def __setitem__(self, nodename, node):
"""Add a node, given its name.

The given nodename is hashed
among the number of replicas.

"""
for hash_ in self._repl_iterator(nodename):
if hash_ in self._nodes:
raise ValueError("Node name %r is "
"already present" % nodename)
self._nodes[hash_] = node
bisect.insort(self._keys, hash_)

def __delitem__(self, nodename):
"""Remove a node, given its name."""

for hash_ in self._repl_iterator(nodename):
# will raise KeyError for nonexistent node name
del self._nodes[hash_]
index = bisect.bisect_left(self._keys, hash_)
del self._keys[index]

def __getitem__(self, key):
"""Return a node, given a key.

The node replica with a hash value nearest
but not less than that of the given
name is returned. If the hash of the
given name is greater than the greatest
hash, returns the lowest hashed node.

"""
hash_ = self._hash(key)
start = bisect.bisect(self._keys, hash_)
if start == len(self._keys):
start = 0
return self._nodes[self._keys[start]]


Test code:

import ConsistentHashRing

if __name__ == '__main__':
server_infos = ["redis1", "redis2", "redis3", "redis4"];
hash_ring = ConsistentHashRing()
test_keys = ["-84942321036308",
"-76029520310209",
"-68343931116147",
"-54921760962352",
"-53401599829545"
];

for server in server_infos:
hash_ring[server] = server

for key in test_keys:
print str(hash_ring[key])

lvc lvc
Answer

You seem to be running into two issues simultaneously: encoding issues and representation issues.

Encoding issues come about particularly since you appear to be using Python 2 - Python 2's str type is not at all like Java's String type, and is actually more like a Java array of byte. But Java's String.getBytes() isn't guaranteed to give you a byte array with the same contents as a Python str (they probably use compatible encodings, but aren't guaranteed to - even if this fix doesn't change things, it's a good idea in general to avoid problems in the future).

So, the way around this is to use a Python type that behaves like Java's String, and convert the corresponding objects from both languages to bytes specifying the same encoding. From the Python side, this means you want to use the unicode type, which is the default string literal type if you are using Python 3, or put this near the top of your .py file:

from __future__ import unicode_literals

If neither of those is an option, specify your string literals this way:

u'text'

The u at the front forces it to unicode. This can then be converted to bytes using its encode method, which takes (unsurprisingly) an encoding:

u'text'.encode('utf-8')

From the Java side, there is an overloaded version of String.getBytes that takes an encoding - but it takes it as a java.nio.Charset rather than a string - so, you'll want to do:

"text".getBytes(java.nio.charset.Charset.forName("UTF-8"))

These will give you equivalent sequences of bytes in both languages, so that the hashes have the same input and will give you the same answer.

The other issue you may have is representation, depending on which hash function you use. Python's hashlib (which is the preferred implementation of md5 and other cryptographic hashes since Python 2.5) is exactly compatible with Java's MessageDigest in this - they both give bytes, so their output should be equivalent.

Python's zlib.crc32 and Java's java.util.zip.CRC32, on the other hand, both give numeric results - but Java's is always an unsigned 64 bit number, while Python's (in Python 2) is a signed 32 bit number (in Python 3, its now an unsigned 32-bit number, so this problem goes away). To convert a signed result to an unsigned one, do: result & 0xffffffff, and the result should be comparable to the Java one.