ilias - 8 months ago 44

Python Question

I have the folowing minimal code which is too slow. For the 1000 rows I need, it takes about 2 min. I need it to run faster.

`import numpy as np`

import pandas as pd

df = pd.DataFrame(np.random.randint(0,1000,size=(1000, 4)), columns=list('ABCD'))

start_algorithm = time.time()

myunique = df['D'].unique()

for i in myunique:

itemp = df[df['D'] == i]

for j in myunique:

jtemp = df[df['D'] == j]

I know that numpy can make it run much faster but keep in mind that I want to keep a part of the original dataframe (or array in numpy) for specific values of column 'D'. How can I improve its performance?

Answer

Avoid computing the sub-DataFrame `df[df['D'] == i]`

more than once. The original code computes this `len(myunique)**2`

times. Instead you can compute this once for each `i`

(that is, `len(myunique)`

times in total), store the results, and then pair them together later. For example,

```
groups = [grp for di, grp in df.groupby('D')]
for itemp, jtemp in IT.product(groups, repeat=2):
pass
```

```
import pandas as pd
import itertools as IT
df = pd.DataFrame(np.random.randint(0,1000,size=(1000, 4)), columns=list('ABCD'))
def using_orig():
myunique = df['D'].unique()
for i in myunique:
itemp = df[df['D'] == i]
for j in myunique:
jtemp = df[df['D'] == j]
def using_groupby():
groups = [grp for di, grp in df.groupby('D')]
for itemp, jtemp in IT.product(groups, repeat=2):
pass
```

```
In [28]: %timeit using_groupby()
10 loops, best of 3: 63.8 ms per loop
In [31]: %timeit using_orig()
1 loop, best of 3: 2min 22s per loop
```

Regarding the comment:

I can easily replace itemp and jtemp with a=1 or print "Hello" so ignore that

The answer above addresses how to compute `itemp`

and `jtemp`

more efficiently. If `itemp`

and `jtemp`

are not central to your real calculation, then we would need to better understand *what you really want to compute* in order to suggest a way to compute it faster.