poPYtheSailor poPYtheSailor - 5 months ago 24
Python Question

For loop for finding concordance is taking a lot of time for large data. (14+hrs for 0.15mln * 36k rows)

I am running this code in python3.5 to find Concordance (logistic regression).

for i in (ones2.index):
for j in (zeros2.index):
pairs_tested = pairs_tested+1
if(ones2.iloc[i,1] > zeros2.iloc[j,1]):
conc = conc+1
elif(ones2.iloc[i,1]==zeros2.iloc[j,1]):
ties = ties+1
else:
disc = disc+1

# Calculate concordance, discordance and ties
concordance = conc/pairs_tested
discordance = disc/pairs_tested
ties_perc = ties/pairs_tested

print("Concordance = %r", concordance)
print("Discordance = %r", discordance)
print("Tied = %r", ties_perc)
print("Pairs = %r", pairs_tested)


There are 0.15mln rows in zeros2(panda dataframe) and 36k rows in ones2(panda dataframe). Both the tables have two variables

[i] Responders (Responder0 = 0 in zeros2 and Responders1 = 1 in ones2).

[ii] Probabilities (prob0 in zeros2 and prob1 in ones2).

My question is: The for loop has taken 12hours and still running at the time when this question is being asked. Need help. how to perform this operation faster. I am running this on a windows 64bit machine with 8GB RAM.

Answer

Your code is doing 5.4 billion calculations due to the two for loops (0.15 mil * 36k):

I would do something like this: (Thanks to @Leon for helping me make this answer better)

from bisect import bisect_left, bisect_right

zeros_list = sorted([zeros2.iloc[j,1] for j in zeros2.index])
zeros2_length = len(zeros2_list)

for i in ones2.index:
    cur_disc = bisect_left(zeros2_list, ones2.iloc[i,1])
    cur_ties = bisect_right(zeros2_list, ones2.iloc[i,1]) - cur_disc
    disc += cur_disc
    ties += cur_ties
    conc += zeros2_length - cur_ties - cur_disc

pairs_tested = zeros2_length * len(ones2.index)

concordance = conc/pairs_tested
discordance = disc/pairs_tested
ties_perc = ties/pairs_tested

print("Concordance = %r", concordance)
print("Discordance = %r", discordance)
print("Tied = %r", ties_perc)
print("Pairs = %r", pairs_tested

Or, the other way round, like this:

zeros_list = sorted([zeros2.iloc[j,1] for j in zeros2.index])
ones2_list = sorted([ones2.iloc[i,1] for i in ones2.index])
zeros2_length = len(zeros2_list)
ones2_length = len(ones2_list)

for i in zeros2.index:
    cur_conc = bisect_left(ones2_list, zeros2.iloc[i,1])
    cur_ties = bisect_right(ones2_list, zeros2.iloc[i,1]) - cur_conc
    conc += cur_conc
    ties += cur_ties
    disc += ones2_length - cur_ties - cur_conc

# We could also achieve the above like this too:
# for i in zeros2_list:
#    cur_conc = bisect_left(ones2_list, i)
#    cur_ties = bisect_right(ones2_list, i) - cur_conc
#    conc += cur_conc
#    ties += cur_ties
#    disc += ones2_length - cur_ties - cur_conc

pairs_tested = zeros2_length * ones2_length

concordance = conc/pairs_tested
discordance = disc/pairs_tested
ties_perc = ties/pairs_tested

print("Concordance = %r", concordance)
print("Discordance = %r", discordance)
print("Tied = %r", ties_perc)
print("Pairs = %r", pairs_tested