Geohash string is a feature in my sparse logistic regression model. So I used java string hashCode to generate int value on geohash string in order to get feature id. But I found hashCode method performs badly on similar geohash strings. It cause different features has the same feature id which may be bad in model optimization even the feature is similar.
For example, those similar geohash string pairs have the same hashCode.
"wws8vw".hashCode() = -774715770
"wws8x9".hashCode() = -774715770
"wmxy0".hashCode() = 113265337
"wmxwn".hashCode() = 113265337
I think that you are misunderstanding the purpose of the
Object.hashCode() method - not hashing in general, but the reason why Java objects have this method:
This method is supported for the benefit of hash tables such as those provided by HashMap.
So if you are trying to use this method as an input to a machine learning model, you're not using it for its intended purpose.
The answer is reasonably obvious: you need to design your own hashing method - or select a pre-existing one - which gives you the desired collision profile for your expected inputs. The one used by
String.hashCode() can't be changed by you.