ilias ilias - 5 months ago 25
Python Question

Python: fast subsetting and looping dataframe

I have the folowing minimal code which is too slow. For the 1000 rows I need, it takes about 2 min. I need it to run faster.

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(0,1000,size=(1000, 4)), columns=list('ABCD'))
start_algorithm = time.time()
myunique = df['D'].unique()
for i in myunique:
itemp = df[df['D'] == i]
for j in myunique:
jtemp = df[df['D'] == j]


I know that numpy can make it run much faster but keep in mind that I want to keep a part of the original dataframe (or array in numpy) for specific values of column 'D'. How can I improve its performance?

Answer

Avoid computing the sub-DataFrame df[df['D'] == i] more than once. The original code computes this len(myunique)**2 times. Instead you can compute this once for each i (that is, len(myunique) times in total), store the results, and then pair them together later. For example,

    groups = [grp for di, grp in df.groupby('D')]
    for itemp, jtemp in IT.product(groups, repeat=2):
        pass

import pandas as pd
import itertools as IT
df = pd.DataFrame(np.random.randint(0,1000,size=(1000, 4)), columns=list('ABCD'))

def using_orig():
    myunique = df['D'].unique()
    for i in myunique:
        itemp = df[df['D'] == i]
        for j in myunique:
            jtemp = df[df['D'] == j]

def using_groupby():
    groups = [grp for di, grp in df.groupby('D')]
    for itemp, jtemp in IT.product(groups, repeat=2):
        pass

In [28]: %timeit using_groupby()
10 loops, best of 3: 63.8 ms per loop
In [31]: %timeit using_orig()
1 loop, best of 3: 2min 22s per loop

Regarding the comment:

I can easily replace itemp and jtemp with a=1 or print "Hello" so ignore that

The answer above addresses how to compute itemp and jtemp more efficiently. If itemp and jtemp are not central to your real calculation, then we would need to better understand what you really want to compute in order to suggest a way to compute it faster.