hallizh hallizh - 7 months ago 19
Python Question

File data to array is using a lot of memory

I'm taking a large text file with tab separated values and adding them to an array.

When I run my code on a 32 Mb file, python memory consumption goes through the roof; using around 500 Mb RAM.

I need to be able to run this code for a 2 GB file, and possibly even larger files.

My current code is:

markers = []

def parseZeroIndex():
with open('chromosomedata') as zeroIndexes:
for line in zeroIndexes:


Running this code against my 2 GB file is not possible as is. The files look like this:

per1 1029292 string1 euqye
per1 1029292 string2 euqys

My questions are:

What is using all this memory?

What is a more efficient way to do this memory wise?


"What is using all this memory?"

There's overhead for Python objects. See how many bytes some strings actually take:

Python 2:

>>> import sys
>>> map(sys.getsizeof, ('', 'a', u'ä'))
[21, 22, 28]

Python 3:

>>> import sys
>>> list(map(sys.getsizeof, ('', 'a', 'ä')))
[25, 26, 38]

"What is a more efficient way to do this memory wise?"

In comments you said there are lots of duplicate values, so string interning (storing only one copy of each distinct string value) might help a lot. Try this:

Python 2:

            markers.append(map(intern, line.rstrip().split('\t')))

Python 3:

            markers.append(list(map(sys.intern, line.rstrip().split('\t'))))

Note I also used line.rstrip() to remove the trailing \n from the line.


I tried

>>> x = [str(i % 1000) for i in range(10**7)]


>>> import sys
>>> x = [sys.intern(str(i % 1000)) for i in range(10**7)]

in Python 3. The first one takes 355 MB (looking at the process in Windows Task Manager). The second one takes only 47 MB. Furthermore:

>>> sys.getsizeof(x)
>>> sum(map(sys.getsizeof, x[:1000]))

So 40 MB is for the list referencing the strings (no surprise, as there are ten million references of four bytes each). And the strings themselves total only 27 KB.