soshial soshial - 1 year ago 78
Python Question

Python memory leak in big data structes (lists, dicts) -- what could be the reason?

The code is extremely simple. It shouldn't have any leaks since all is done inside the function. And nothing is returned.
I have a function which goes over all lines in a file (~20 MiB) and puts them all into a list.

Mentioned function:

def read_art_file(filename, path_to_dir):
import codecs
corpus = []
corpus_file = + filename, 'r', 'iso-8859-15')
newline = corpus_file.readline().strip()
while newline != '':
# we put into @article a @newline of file and some other info
# (i left those lists blank for readability)
article = [newline, [], [], [], [], [], [], [], [], [], [], [], []]
del newline
del article
newline = corpus_file.readline().strip()
memory_usage('inside function')
for article in corpus:
for word in article:
del word
del article
del corpus
memory_usage('inside: after corp deleted')

Here is the main code:

path_to_dir = '/home/soshial/internship/training_data/parser_output/'
read_art_file('', path_to_dir)
memory_usage('outside func')

just prints amount of KiB allocated by the script.

Executing the script

If I run the script, it gives me:

START memory: 6088 KiB

inside memory: 393752 KiB (20 MiB file + lists occupy 400 MiB)

inside: after corp deleted memory: 43360 KiB

outside func memory: 34300 KiB (34300-6088= 28 MiB leaked)

FINISH memory: 34300 KiB

Executing without lists

And if I do absolutely the same thing, but with appending
to the
commented out:

article = [newline, [], [], [], [], [], ...] # we still assign data to `article`
# corpus.append(article) # we don't have this string during second execution

This way output gives me:

START memory: 6076 KiB

inside memory: 6076 KiB

inside: after corp deleted memory: 6076 KiB

outside func memory: 6076 KiB

FINISH memory: 6076 KiB


Hence, this way all memory is being freed. I need to have all memory freed since I'm going to process hundreds of such files.

Is it that I do something wrong or it is the CPython interpreter bug?

UPD. This is how I check memory consumption (taken from some other stackoverflow question):

def memory_usage(text = ''):
"""Memory usage of the current process in kilobytes."""
status = None
result = {'peak': 0, 'rss': 0}
# This will only work on systems with a /proc file system
# (like Linux).
status = open('/proc/self/status')
for line in status:
parts = line.split()
key = parts[0][2:-1].lower()
if key in result:
result[key] = int(parts[1])
if status is not None:
print('>', text, 'memory:', result['rss'], 'KiB ')

Answer Source

Please note that python never guarantees that any memory that your code uses will actually get returned to the OS. All that garbage collection guarantees is that the memory used by an object which has been collected is free to be used by another object at some future time.

From what I've read1 about the Cpython implementation of the memory allocator, memory gets allocated in "pools" for efficiency. When a pool is full, python will allocate a new pool. If a pool contains only dead objects, Cpython actually free the memory associated with that pool, but otherwise it doesn't. This can lead to multiple partially full pools hanging around after a function or whatever. However, this doesn't really mean it is a "memory leak". (Cpython still knows about the memory and could potentially free it at some later time).

1I'm not a python dev, so these details are likely to be incorrect or at least incomplete

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download