Kelvin Kelvin - 8 months ago 32
Python Question

Hashing a generator expression

Apparently python will allow me to hash a generator expression like

(i for i in [1, 2, 3, 4, 5])

>>> hash(i for i in [1, 2, 3, 4, 5])

On closer inspection however, this hash value is always the same no matter what generator I put into it!

>>> hash(i for i in range(2))
>>> hash(i for i in [1, 2, 3, 4, 5, 6])
>>> hash(i for i in [0, 1, 2, 3, 4, 5, 6])

why is this happening? Why is hashing a generator even allowed?

I need to do this because I’m storing stuff in a dictionary cache. You access entries in the cache by passing lists of objects. But certain groups of lists with different items should still point to the same entry, since the only thing that matters here is the integer value of one attribute in each list item. So they should hash only based on those integer values, not anything relating to the items themselves, to avoid unnecessary cache misses.

I'm aware you can always just convert the expression to a tuple, but I was asking if you can bypass that by simply using the output of the generator without the tuple container, similar to how
could be used for such a thing.


So there are two questions here actually:

  1. Why do generators have a hash, when e.g. lists do not?
  2. Why do you always get the same hash?

For 2, the answer is simple: In this case, hash is based on the object's id. Since you don't actually store the object, its memory gets reused. That means the next generator has the same id and thus hash.

For 1, the answer is "because they can". hash is primarily meant for use in dict, set and other situations where it allows identifying an object. These situations set the constraint that a == b also implies hash(a) == hash(b) (the reverse is not constrained).

Now, for list, dict and other collections, equality is based on content. [1,2,3] == [1,2,3] regardless whether both are the same objects, for example. This means if something is added to them, their equality changes and thus their hash would change as well. Thus, hash is undefined, as it must be a constant for it to work in dict etc.

In contrast, a generator can have any content. Consider for example a generator providing random values. Thus, it makes no sense to compare generators by content. They are only ever compared by identity. So, a == b equals id(a) == id(b) for generators. In turn, this means basing hash(a) on id(a) will always satisfy the constraint by equality on hash.