snowleopard snowleopard - 4 months ago 32
Python Question

Memory Usage, Filling Pandas DataFrame using Dict vs using key and value Lists

I am making a package that reads a binary file and returns data that can be used to initialize a

DataFrame
, I am now wondering if it is best to return a
dict
or two lists (one that holds the keys and one that holds the values).

The package I am making is not supposed to be entirely reliant on a
DataFrame
object, which is why my package currently outputs the data as a
dict
(for easy access). If there could be some memory and speed savings (which is paramount for my application as I am dealing with millions of data points), I would like to output the key and value lists instead. These iterables would then be used to initialize a
DataFrame
.

Here is a simple example:

In [1]: d = {(1,1,1): '111',
...: (2,2,2): '222',
...: (3,3,3): '333',
...: (4,4,4): '444'}

In [2]: keyslist=[(1,1,1),(2,2,2),(3,3,3),(4,4,4)]

In [3]: valslist=['111','222','333','444']

In [4]: import pandas as pd

In [5]: dfdict=pd.DataFrame(d.values(), index=pd.MultiIndex.from_tuples(d.keys(), names=['a','b','c']))

In [6]: dfdict
Out[6]:
0
a b c
3 3 3 333
2 2 2 222
1 1 1 111
4 4 4 444

In [7]: dflist=pd.DataFrame(valslist, index=pd.MultiIndex.from_tuples(keyslist, names=['a','b','c']))

In [8]: dfpair
Out[8]:
0
a b c
1 1 1 111
2 2 2 222
3 3 3 333
4 4 4 444


It is my understanding that
d.values()
and
d.keys()
is creating a new copy of the data. If we disregard the fact the a
dict
takes more memory then a
list
, does using
d.values()
and
d.keys()
lead to more memory usage then the
list
pair implementation?

Answer

I made memory profiling of 1M rows. The winning structure is to use array.array for every numerical index and a list for strings (147MB data and 310MB conversion to pandas).

According to Python manual

Arrays are sequence types and behave very much like lists, except that the type of objects stored in them is constrained.

They even have append method and most likely have very fast append speed.

Second place goes to two separate lists. (308MB and 450MB)

The other two options, using a dict and using a list with tuples of four, were the worst. Dict: 339MB, 524MB. List of four: 308MB, 514MB.

Here is the use of array.array:

In [1]: from array import array
In [2]: import gc
In [3]: import pandas as pd
In [4]: %load_ext memory_profiler
In [5]: a1=array("l",range(1000000))
In [6]: a2=array("l",range(1000000))
In [7]: a3=array("l",range(1000000))
In [8]: b=[str(x*111) for x in list(range(1000000))]
In [9]: gc.collect()
Out[9]: 0
In [10]: %memit a1,a2,a3,b
peak memory: 147.64 MiB, increment: 0.32 MiB
In [11]: %memit dfpair=pd.DataFrame(b,  index=pd.MultiIndex.from_arrays([a1,a2,a3], names=['a','b','c']))
peak memory: 310.60 MiB, increment: 162.91 MiB

Here is the rest of the code (very long):

List of tuples of four:

In [1]: import gc
In [2]: import pandas as pd
In [3]: %load_ext memory_profiler
In [4]: a=list(zip(list(range(1000000)),list(range(1000000)),list(range(1000000))))
In [5]: b=[str(x*111) for x in list(range(1000000))]
In [6]: d2=[x+(b[i],) for i,x in enumerate(a)]
In [7]: del a
In [8]: del b
In [9]: gc.collect()
Out[9]: 0
In [10]: %memit d2
peak memory: 308.40 MiB, increment: 0.28 MiB
In [11]: %memit df = pd.DataFrame(d2, columns=['a','b','c','d']).set_index(['a','b','c'])
peak memory: 514.21 MiB, increment: 205.80 MiB

Dictionary:

In [1]: import gc
In [2]: import pandas as pd
In [3]: %load_ext memory_profiler
In [4]: a=list(zip(list(range(1000000)),list(range(1000000)),list(range(1000000))))
In [5]: b=[str(x*111) for x in list(range(1000000))]
In [6]: d = dict(zip(a, b))
In [7]: del a
In [8]: del b
In [9]: gc.collect()
Out[9]: 0
In [10]: %memit d
peak memory: 339.14 MiB, increment: 0.23 MiB
In [11]: %memit dfdict=pd.DataFrame(list(d.values()),  index=pd.MultiIndex.from_tuples(d.keys(), names=['a','b','c']))
peak memory: 524.10 MiB, increment: 184.95 MiB

Two arrays:

In [1]: import gc
In [2]: import pandas as pd
In [3]: %load_ext memory_profiler
In [4]: a=list(zip(list(range(1000000)),list(range(1000000)),list(range(1000000))))
In [5]: b=[str(x*111) for x in list(range(1000000))]
In [6]: gc.collect()
Out[6]: 0
In [7]: %memit a,b
peak memory: 307.75 MiB, increment: 0.19 MiB
In [8]: %memit dfpair=pd.DataFrame(b,  index=pd.MultiIndex.from_tuples(a, names=['a','b','c']))
peak memory: 459.94 MiB, increment: 152.19 MiB