ajsp - 11 months ago 48

Python Question

I have some very large lists that I am working with (>1M rows), and I am trying to find a fast (the fastest?) way of, given a float, ranking that float compared to the list of floats, and finding it's percentage rank compared to the range of the list. Here is my attempt, but it's extremely slow:

`X =[0.595068426145485,`

0.613726840488019,

1.1532608695652,

1.92952380952385,

4.44137931034496,

3.46432160804035,

2.20331487122673,

2.54736842105265,

3.57702702702689,

1.93202764976956,

1.34720184204056,

0.824997304105564,

0.765782842381996,

0.615110856990126,

0.622708022872803,

1.03211045820975,

0.997225012974318,

0.496352327702226,

0.67103858866700,

0.452224068868272,

0.441842124852685,

0.447584524952608,

0.4645525042246]

val = 1.5

arr = np.array(X) #X is actually a pandas column, hence the conversion

arr = np.insert(arr,1,val, axis=None) #insert the val into arr, to then be ranked

st = np.sort(arr)

RANK = float([i for i,k in enumerate(st) if k == val][0])+1 #Find position

PCNT_RANK = (1-(1-round(RANK/len(st),6)))*100 #Find percentage of value compared to range

print RANK, PCNT_RANK

>>> 17.0 70.8333

For the percentage ranks I could probably build a distribution and sample from it, not quite sure yet, any suggestions welcome...it's going to be used heavily so any speed-up will be advantageous.

Thanks.

Answer

The two slow parts of your code are:

`st = np.sort(arr)`

. Sorting the list is on average O(n log n) time, where n is the size of the list.`RANK = float([i for i,k in enumerate(st) if k == val][0])+1`

. Iterating through the list has O(n) time.

**If you don't need to sort the list**, then as @ChrisMueller points out, you can just iterate through it once without sorting, which takes O(n) time and will be the fastest option.

**If you do need to sort the list** (or have access to it pre-sorted), then the fastest option for the second step is `RANK = np.searchsorted(st, val) + 1`

. Since the list is already sorted, finding the index will only take O(log n) time by binary search instead of having to iterate through the whole list. This will still be a lot faster than your original code.