dmtr dmtr - 5 months ago 20
Python Question

Interning immutable objects

I have a program with an object of the form

class MyObj(object):
def __init__(self, a, b):
self.__a = a
self.__b = b
self.cache = very_expensive_computation(a, b)


where
a,b
are immutable and the space of possible parameters
a,b
is very small. These objects are created and go out of scope constantly during the execution of the program, so unsurprisingly a lot of time is spent recomputing
self.cache
for the same values
a,b
. Since the
very_expensive_computation
is very expensive, it seems it would be better to just avoid garbage collecting these items and have the constructor return references to the already existing objects if possible, kinda like string interning.

The obvious way to do this to me seems to be to add a dictionary to the class and override
__new__
and
__init__
so that they check the dictionary and return already existing instances if possible, but at the same time this feels kinda unsatisfactory since it would have to be done to each class separately and since detecting whether you are an actual new object in
__init__
would probably be pretty hacky.

Any other suggestions?

Answer

I'd memoize the very_expensive_computation storing the results in an LRU cache to guarantee an upper bound on the amount of memory used:

_very_expensive_computation_cache = RecentlyUsedContainer(100)

def cached_very_expensive_computation(a, b):
   if (a, b) not in _very_expensive_computation_cache:
      _very_expensive_computation_cache[(a, b)] = very_expensive_computation(a, b)
  return _very_expensive_computation_cache[(a, b)]

Where the RecentlyUsedContainer could be this one: https://github.com/shazow/unstdlib.py/blob/master/unstdlib/standard/collections_.py#L12

You could also simplify the code with a decorator:

from unstdlib.standard.functools_ import memoized

@memoized(cache=RecentlyUsedContainer(100))
def cached_very_expensive_comptuation(a, b):
    return very_expensive_computation(a, b)

See: https://github.com/shazow/unstdlib.py/blob/master/unstdlib/standard/functools_.py#L59

I prefer to keep the memoized versions of functions separate from the "real" versions so callers can explicitly see that they could be getting a cached result, and it can make testing easier. This is largely personal preference, though.

Edit: as pointed out in the comments, Python 3 ships with functools.lru_cache.