Khris Khris - 3 months ago 8
Python Question

Is there a faster way of doing full row comparisons on a small pandas dataframe than using loops and iloc?

I have a large number of small pandas dataframes on which I have to do full row comparisons and write the results into new dataframes which will get concatenated later.

For the row comparisons I'm doing a double loop over the length of the dataframe using iloc. I don't know if there is a faster way, the way I'm doing it seems really slow:

# -*- coding: utf-8 -*-
import pandas as pd
import time

def processFrames1(DF):
LL = []
for i in range(len(DF)):
for j in range(len(DF)):
if DF.iloc[i][0] != DF.iloc[j][0]:
T = {u'T1':DF.iloc[i][0]}
T[u'T2'] = DF.iloc[j][0]
T[u'T3'] = 1
if DF.iloc[i][2] > DF.iloc[j][2]:
T[u'T4'] = 1
elif DF.iloc[i][2] < DF.iloc[j][2]:
T[u'T4'] = -1
else:
T[u'T4'] = 0
if DF.iloc[i][1] < DF.iloc[j][1]:
T[u'T5'] = 1
else:
T[u'T5'] = -1
LL.append(T)
return pd.DataFrame.from_dict(LL)

D = [{'A':'XA','B':1,'C':1.4}\
,{'A':'RT','B':2,'C':10}\
,{'A':'HO','B':3,'C':34}\
,{'A':'NJ','B':4,'C':0.41}\
,{'A':'WF','B':5,'C':114}\
,{'A':'DV','B':6,'C':74}\
,{'A':'KP','B':7,'C':2.4}]

P = pd.DataFrame.from_dict(D)
time0 = time.time()
for i in range(10):
X = processFrames1(P)
print time.time()-time0
print X


Yielding the result:

0.836999893188
T1 T2 T3 T4 T5
0 XA RT 1 -1 1
1 XA HO 1 -1 1
2 XA NJ 1 1 1
3 XA WF 1 -1 1
4 XA DV 1 -1 1
5 XA KP 1 -1 1
6 RT XA 1 1 -1
7 RT HO 1 -1 1
8 RT NJ 1 1 1
9 RT WF 1 -1 1
10 RT DV 1 -1 1
11 RT KP 1 1 1
12 HO XA 1 1 -1
13 HO RT 1 1 -1
14 HO NJ 1 1 1
15 HO WF 1 -1 1
16 HO DV 1 -1 1
17 HO KP 1 1 1
18 NJ XA 1 -1 -1
19 NJ RT 1 -1 -1
20 NJ HO 1 -1 -1
21 NJ WF 1 -1 1
22 NJ DV 1 -1 1
23 NJ KP 1 -1 1
24 WF XA 1 1 -1
25 WF RT 1 1 -1
26 WF HO 1 1 -1
27 WF NJ 1 1 -1
28 WF DV 1 1 1
29 WF KP 1 1 1
30 DV XA 1 1 -1
31 DV RT 1 1 -1
32 DV HO 1 1 -1
33 DV NJ 1 1 -1
34 DV WF 1 -1 -1
35 DV KP 1 1 1
36 KP XA 1 1 -1
37 KP RT 1 -1 -1
38 KP HO 1 -1 -1
39 KP NJ 1 1 -1
40 KP WF 1 -1 -1
41 KP DV 1 -1 -1


Working this representative dataframe just 10 times takes almost a full second, and I will have to work with over a million.

Is there a faster way to do those full row comparisons?

EDIT1:
After some modifications I could make Javier's code create the correct output:

def compare_values1(x,y):
if x>y: return 1
elif x<y: return -1
else: return 0

def compare_values2(x,y):
if x<y: return 1
elif x>y: return -1
else: return 0

def processFrames(P):
D = P.to_dict(orient='records')
d_A2B = {d["A"]:d["B"] for d in D}
d_A2C = {d["A"]:d["C"] for d in D}
keys = list(d_A2B.keys())
LL = []
for i in range(len(keys)):
k_i = keys[i]
for j in range(len(keys)):
if i != j:
k_j = keys[j]
LL.append([k_i,k_j,1,compare_values1(\
d_A2C[k_i],d_A2C[k_j]),compare_values2(d_A2B[k_i],d_A2B[k_j])])
return pd.DataFrame(LL,columns=['T1','T2','T3','T4','T5'])


This function works about 60 times faster.

EDIT2:
Final verdict of the four possibilities:

=============== With the small dataframe:

My original function:

%timeit processFrames1(P)
10 loops, best of 3: 85.3 ms per loop


jezrael's solution:

%timeit processFrames2(P)
1 loop, best of 3: 286 ms per loop


Javier's modified code:

%timeit processFrames3(P)
1000 loops, best of 3: 1.24 ms per loop


Divakar's method:

%timeit processFrames4(P)
1000 loops, best of 3: 1.98 ms per loop


=============== For the large dataframe:

My original function:

%timeit processFrames1(P)
1 loop, best of 3: 2.22 s per loop


jezrael's solution:

%timeit processFrames2(P)
1 loop, best of 3: 295 ms per loop


Javier's modified code:

%timeit processFrames3(P)
100 loops, best of 3: 3.13 ms per loop


Divakar's method:

%timeit processFrames4(P)
100 loops, best of 3: 2.19 ms per loop


So it's pretty much a tie between the last two. Thanks to everyone for helping, that speedup was much needed.

Answer

Don't use pandas. Use dictionaries and save it:

def compare_values(x,y):
  if x>y: return 1
  elif x<y: return -1
  else: return 0

def processFrames(P):
  d_A2B = dict(zip(P["A"],P["B"]))
  d_A2C = dict(zip(P["A"],P["C"]))

  keys = list(d_A2B.keys())
  d_ind2key = dict(zip(range(len(keys)),keys))
  LL = []
  for i in range(len(keys)):
    k_i = keys[i]
    for j in range(i+1,len(keys)):
        k_j = keys[j]
        c1 = compare_values(d_A2C[k_i],d_A2C[k_j])
        c2 = -compare_values(d_A2B[k_i],d_A2B[k_j])
        LL.append([k_i,k_j,1,c1,c2])
        LL.append([k_j,k_i,1,-c1,-c2])
  return pd.DataFrame(LL,columns=['T1','T2','T3','T4','T5'])