ajsp ajsp - 5 months ago 15
Python Question

Find rank and percentage rank in list

I have some very large lists that I am working with (>1M rows), and I am trying to find a fast (the fastest?) way of, given a float, ranking that float compared to the list of floats, and finding it's percentage rank compared to the range of the list. Here is my attempt, but it's extremely slow:

X =[0.595068426145485,
0.613726840488019,
1.1532608695652,
1.92952380952385,
4.44137931034496,
3.46432160804035,
2.20331487122673,
2.54736842105265,
3.57702702702689,
1.93202764976956,
1.34720184204056,
0.824997304105564,
0.765782842381996,
0.615110856990126,
0.622708022872803,
1.03211045820975,
0.997225012974318,
0.496352327702226,
0.67103858866700,
0.452224068868272,
0.441842124852685,
0.447584524952608,
0.4645525042246]

val = 1.5
arr = np.array(X) #X is actually a pandas column, hence the conversion
arr = np.insert(arr,1,val, axis=None) #insert the val into arr, to then be ranked
st = np.sort(arr)

RANK = float([i for i,k in enumerate(st) if k == val][0])+1 #Find position
PCNT_RANK = (1-(1-round(RANK/len(st),6)))*100 #Find percentage of value compared to range


print RANK, PCNT_RANK
>>> 17.0 70.8333


For the percentage ranks I could probably build a distribution and sample from it, not quite sure yet, any suggestions welcome...it's going to be used heavily so any speed-up will be advantageous.

Thanks.

Answer

The two slow parts of your code are:

  • st = np.sort(arr). Sorting the list is on average O(n log n) time, where n is the size of the list.

  • RANK = float([i for i,k in enumerate(st) if k == val][0])+1. Iterating through the list has O(n) time.

If you don't need to sort the list, then as @ChrisMueller points out, you can just iterate through it once without sorting, which takes O(n) time and will be the fastest option.

If you do need to sort the list (or have access to it pre-sorted), then the fastest option for the second step is RANK = np.searchsorted(st, val) + 1. Since the list is already sorted, finding the index will only take O(log n) time by binary search instead of having to iterate through the whole list. This will still be a lot faster than your original code.

Comments