I am trying to understand how HashMap is implemented in Java. I decided that I will try to understand every line (of code and comments) from that class and obviously I faced resistance very soon. The following snippet is from HashMap class and talks about Poisson Distribution:
- Ideally, under random hashCodes, the frequency of
- nodes in bins follows a Poisson distribution
- (http://en.wikipedia.org/wiki/Poisson_distribution) with a
- parameter of about 0.5 on average for the default resizing
- threshold of 0.75, although with a large variance because of
- resizing granularity. Ignoring variance, the expected
- occurrences of list size k are (exp(-0.5) * pow(0.5, k) /
- factorial(k)). The first values are:
- 0: 0.60653066
- 1: 0.30326533
- 2: 0.07581633
- 3: 0.01263606
- 4: 0.00157952
- 5: 0.00015795
- 6: 0.00001316
- 7: 0.00000094
- 8: 0.00000006
- more: less than 1 in ten million
A HashMap is organized as an array of "buckets" based on the hashCode of the elements being inserted. Each bucket is (by default) a linked list of elements. Each bucket would have very few elements (ideally, at most one) so that finding a particular element requires very little searching down a linked list.
To take a simple example, let's say we have a HashMap of capacity 4 and a load factor of 0.75 (the default) which means that it can hold up to 3 elements before being resized. An ideal distribution of elements into buckets would look something like this:
bucket | elements -------+--------- 0 | Z 1 | X 2 | 3 | Y
so any element can be found immediately without any searching within a bucket. On the other hand, a very poor distribution of elements would look like this:
bucket | elements -------+--------- 0 | 1 | Z -> X -> Y 2 | 3 |
This will occur if all of the elements happen to hash into the same bucket, so searching for element Y will require traversing down the linked list.
This might not seem like a big deal, but if you have a HashMap with a capacity of 10,000 elements and there are 7,500 elements in a single bucket on a linked list, searching for a particular element will degrade to linear search time -- which is what using a HashMap is trying to avoid.
One issue is that the hashCode for distributing elements into buckets is determined by the objects themselves, and objects' hashCode implementations aren't always very good. If the hashCode isn't very good, then elements can bunch up in certain buckets, and the HashMap will begin to perform poorly.
The comment from the code is talking about the likelihood of different lengths of linked lists appearing in each bucket. First, it assumes the hashCodes are randomly distributed -- which isn't always the case! -- and I think it also assumes that the number of elements in the HashMap is 50% of the number of buckets. Under these assumptions, according to that Poisson distribution, 60.6% of the buckets will be empty, 30.3% will have one element, 7.5% will have two elements, 1.2% will have three elements, and so forth.
In other words, given those (ideal) assumptions, the linked lists within each bucket will usually be very short.
In JDK 8 there is an optimization to turn a linked list into a tree above a certain threshold size, so that at least performance degrades to O(log n) instead of O(n) in the worst case. The question is, what value should be chosen as the threshold? That's what this discussion is all about. The current threshold value TREEIFY_THRESHOLD is 8. Again, under these ideal assumptions, a bucket with a linked list of length 8 will occur only 0.000006% of the time. So if we get a linked list that long, something is clearly not ideal!! It may mean, for instance, that the objects being stored have exceptionally bad hashCodes, so the HashMap has to switch from a linked list to a tree in order to avoid excessive performance degradation.
The link to the source file with the comment in question is here: