I have some very large lists that I am working with (>1M rows), and I am trying to find a fast (the fastest?) way of, given a float, ranking that float compared to the list of floats, and finding it's percentage rank compared to the range of the list. Here is my attempt, but it's extremely slow:
val = 1.5
arr = np.array(X) #X is actually a pandas column, hence the conversion
arr = np.insert(arr,1,val, axis=None) #insert the val into arr, to then be ranked
st = np.sort(arr)
RANK = float([i for i,k in enumerate(st) if k == val])+1 #Find position
PCNT_RANK = (1-(1-round(RANK/len(st),6)))*100 #Find percentage of value compared to range
print RANK, PCNT_RANK
>>> 17.0 70.8333
The two slow parts of your code are:
st = np.sort(arr). Sorting the list is on average O(n log n) time, where n is the size of the list.
RANK = float([i for i,k in enumerate(st) if k == val])+1. Iterating through the list has O(n) time.
If you don't need to sort the list, then as @ChrisMueller points out, you can just iterate through it once without sorting, which takes O(n) time and will be the fastest option.
If you do need to sort the list (or have access to it pre-sorted), then the fastest option for the second step is
RANK = np.searchsorted(st, val) + 1. Since the list is already sorted, finding the index will only take O(log n) time by binary search instead of having to iterate through the whole list. This will still be a lot faster than your original code.