Manoel Ribeiro Manoel Ribeiro - 1 month ago 17x
Python Question

Pandas parallel groupBy consumes tons of memory

I have a medium sized file (~300MB) containing a list of individuals (~300k) and actions they performed. I'm trying to apply an operation for each individuals using

and the paralellized version of
described here. It looks something like this

import pandas
import multiprocessing
from joblib import Parallel, delayed

df = pandas.read_csv(src)
patients_table_raw = apply_parallel(df.groupby('ID'), f)

def applyParallel(dfGrouped, func):
retLst = Parallel(n_jobs=multiprocessing.cpu_count())(delayed(func)(group) for name, group in dfGrouped)
return pd.concat(retLst)

But unfortunately this consumes A HELL LOT OF SPACE. I think it is related with the fact that the simple command:

list_groups = list(df.groupby('ID'))

Consumes several GB of memory! How to procceed? My initial thoughts were to iterate the groupBy in small 'stacks', not consuming too much memory (but I didn't found a way to do so without casting it to a list).

More detailed context

I have a simple CSV dataset in the following fashion:

| ID | Timestamp | Action |
|1 | 0 | A |
|1 | 10 | B |
|1 | 20 | C |
|2 | 0 | B |
|2 | 15 | C |

What I'm basically trying to do is create a different table that contains a description of sequence of actions/timestamps of the individuals and their IDs. This will help me retrieve the individuals

| ID | Description |
|1 | 0A10B20C |
|2 | 0B15C |

In order to do so, and to follow a Pythonic way, my idea was basically to load the first table in a pandas DataFrame, groupBy the ID, and apply a function in the grouping that returns a row of the table I want for each group (each ID). However, I have LOTS of individuals in my dataset (around 1 million), and the groupBy operation was extremely expensive (without explicit garbage collection, as I mentioned in my own answer). Also, parallelizing the groupBy implied in significant memory use, because apparently some things get duplicated.

Therefore, the more detailed question is: how to use groupBy (and therefore make the data processing faster than if you would implement a loop of your own) and don't get this huge memory overhead?


Try this (without parallelization):

In [87]: df
   ID  Timestamp Action
0   1          0      A
1   1         10      B
2   1         20      C
3   2          0      B
4   2         15      C

In [88]: df.set_index('ID').astype(str).sum(axis=1).groupby(level=0).sum().to_frame('Description').reset_index()
   ID Description
0   1    0A10B20C
1   2       0B15C