cmishra cmishra - 14 days ago 5
Python Question

Persisting state when implementing MinHeap with heapq

I'm trying to implement a

StreamingMedian
object to maintain the median through successive calls of
get_median()
. To do so, I've also implemented a
MinHeap
and
MaxHeap
class via the
heapq
module.

I've run into a very odd bug though. For some reason when I run the following commands:

print("Before streaming medians", MinHeap(), sep="\t") # is empty

b = StreamingMedian()
b.add_val(5)
b.add_val(100)

assert b.get_median() == 52.5

print("After streaming medians, for MaxHeap", MaxHeap(), sep='\t') # is empty
print("After streaming medians, for MinHeap", MinHeap(), sep='\t') # should be empty
print("After streaming medians, for MinHeap with input",
MinHeap([]), sep="\t") # is empty


I get the following outputs:

Before streaming medians []
After streaming medians, for MaxHeap []
After streaming medians, for MinHeap [100]
After streaming medians, for MinHeap with input []


The class implementations can be found below. I'm running this on Python 3.5.2 :: Anaconda custom (64-bit).

import heapq

class MinHeap(object):

def __init__(self, l=[]):
self.heap = l
heapq.heapify(l)

def peek(self):
return self.heap[0]

def pop(self):
return heapq.heappop(self.heap)

def push(self, x):
heapq.heappush(self.heap, x)

def pushpop(self, x):
return heapq.heappushpop(self.heap, x)

def replace(self, x):
return heapq.heapreplace(self.heap, x)

def __len__(self):
return len(self.heap)

def __str__(self):
return str(self.heap)

class MaxHeap(MinHeap):

def _invert_sign(self, l):
return [-1 * a for a in l]

def __init__(self, l=[]):
super().__init__(self._invert_sign(l))

def push(self, x):
super().push(-1 * x)

def pushpop(self, x):
return super().pushpop(-1 * x)
def replace(self, x):
return super().replace(-1 * x)

def pop(self):
return -1 * super().pop()

def peek(self):
return -1 * super().peek()

def __str__(self):
return str(self._invert_sign(self.heap))


class StreamingMedian():

def __init__(self):
self.min_heap = MinHeap()
self.max_heap = MaxHeap()

def get_median(self):
min_heap_has_x_more = len(self.min_heap) - len(self.max_heap)
if min_heap_has_x_more > 0:
return self.min_heap.peek()
elif min_heap_has_x_more < 0:
return self.max_heap.peek()
else:
return (self.min_heap.peek() + self.max_heap.peek())/2

def add_val(self, x):
if len(self.min_heap) + len(self.max_heap) == 0:
self.max_heap.push(x)
else:
med = self.get_median()
if x > med:
self.min_heap.push(x)
self._ensure_balance()
elif x < med:
self.max_heap.push(x)
self._ensure_balance()
else:
self.max_heap.push(x)
self._ensure_balance()

def _ensure_balance(self):
size_diff = len(self.min_heap) - len(self.max_heap)
if abs(size_diff) > 1:
if size_diff > 0: # min_heap has 2 more elements
self.max_heap.push(self.min_heap.pop())
else: # max_heap has 2 more elements
self.min_heap.push(self.max_heap.pop())
assert abs(len(self.min_heap) - len(self.max_heap)) < 2

print("Before streaming medians", MinHeap(), sep="\t")

b = StreamingMedian()
b.add_val(5)
b.add_val(100)

assert b.get_median() == 52.5

print("After streaming medians, for MaxHeap", MaxHeap(), sep='\t') # is empty
print("After streaming medians, for MinHeap", MinHeap(), sep='\t') # should be empty
print("After streaming medians, for MinHeap with input", MinHeap([]), sep="\t") # is empty

Answer

The Problem

The problem was occurring because of how default arguments are calculated in python. They are computed the first time a function is called and afterwards the function is used with the originally calculated value. If the default value is mutable (as a list is) then this poses a problem.

So what was happening for the MinHeap was:

  1. MinHeap initially created and the l parameter default argument was assigned to a memory address.
  2. StreamingMedian modifies MinHeap's self.heap which is the same as what l points to.
  3. When MinHeap is called again, l already has a memory address and this memory address is used again and binded with self.heap.

This doesn't happen for MaxHeap because:

  1. MaxHeap initially created and the l parameter default argument was assigned to a memory address.
  2. _invert_sign creates a new list which is assigned to self.heap. The default argument l is never modified.
  3. When MaxHeap is initialized again, l already has a memory address and is used again, but it was never modified and another copy is constructed so it will never be modified.

The Solution

Instead of:

class MinHeap(object):

def __init__(self, l=[]):
    self.heap = l
    heapq.heapify(l)

We should use:

class MinHeap(object):

def __init__(self, l=None):
    if l is None:
        l = []
    self.heap = l
    heapq.heapify(l)

A similar change should be made for MaxHeap