Parth Bhoiwala - 6 months ago 31

Python Question

I am generating 100 random integers and I want to store them in a sorted array. The first approach that I tried was using a binary search to find the proper index to store each number at and then insert the number at that index. This way, after 100th random number, I will have a sorted array. Binary search has a time complexity of

`log(N)`

`insert()`

`N`

`O(Nlog(N))`

Below is the code for this approach:

`def binary_search(start, end, item):`

mid = (start + end)/2

if item > mlist[end]:

return end+1

elif item > mlist[mid]:

return binary_search(mid+1, end, item)

elif item < mlist[mid]:

return binary_search(start, mid-1, item)

else:

return int(math.ceil(mid))

begin = time.time()

for i in range(100):

rand = randint(0,100)

index = binary_search(0,len(mlist)-1,rand)

mlist.insert(index,rand)

elapsed = time.time()

print((elapsed-begin)*(10**4))

When I printed the difference between elapsed and begin time, I got

My second approach was to simply add all random numbers in an array/list and then use the built in

`sort()`

`sort()`

`Nlog(N)`

`begin = time.time()`

mlist=[]

for i in range(100):

rand = randint(0,100)

mlist.append(rand)

mlist = sorted(mlist)

elapsed = time.time()

print((elapsed-begin)*(10**4))

The elapsed time for this approach was

I don't understand that if the time complexities for both methods are same then what makes the second approach so much faster?

Answer Source

Your binary search insertion is O(N^2); *each insertion* has to move up to O(N) elements up one step to the right, and you do this N times. However, even if it was O(NlogN), the constant cost of the sorting code is *far lower* than your Python code could match.

If you do want to stick to bisect insertion sorting, rather than re-invent the (admittedly simple) `bisect`

wheel, do use the `bisect`

module. This module comes with a C-optimised implementation.

It even has `bisect.insort*()`

functions, which note:

Keep in mind that the O(log n) search is dominated by the slow O(n) insertion step.

Another tip: don't use wall-clock time to measure algorithms. Use the `timeit`

module, which disables the garbage collector, uses the most accurate clock available, and runs your test *multiple times* to eliminate external factors.

Next, don't include creating the random values, you don't want to time how fast those can be produced; produce one list, *up front*, and re-use it for all timings.

Next, use a proper `bisect()`

function, yours is broken for any `len(mlist) < 2`

; there is *no need* to test for `mlist[end]`

for example. The following avoids an off-by-one error and uses `mlist`

as a parameter rather than a global:

```
def binary_search(mlist, item, start=0, end=None):
if end is None:
end = len(mlist)
if start >= end:
return start
mid = (start + end) // 2
if item >= mlist[mid]:
return binary_search(mlist, item, mid + 1, end)
else:
return binary_search(mlist, item, start, mid)
```

Use `binary_search(mlist, value)`

to call it, the start and end values are filled in for you.

Now you can conduct a proper timed test:

```
>>> import random, timeit
>>> values = [random.randint(0, 100) for _ in range(100)]
>>> def binsort(l):
... mlist = []
... for i in l:
... index = binary_search(0, len(mlist), i, mlist)
... mlist.insert(index, i)
... return mlist
...
>>> count, time = timeit.Timer('binsort(values)', 'from __main__ import values, binsort').autorange()
>>> format(time / count, '.15f')
'0.000146628010299'
>>> count, time = timeit.Timer('sorted(values)', 'from __main__ import values').autorange()
>>> format(time / count, '.15f')
'0.000008379445840'
>>> values = [random.randint(0, 100) for _ in range(1000)]
>>> count, time = timeit.Timer('binsort(values)', 'from __main__ import values, binsort').autorange()
>>> format(time / count, '.15f')
'0.002460538140149'
>>> count, time = timeit.Timer('sorted(values)', 'from __main__ import values').autorange()
>>> format(time / count, '.15f')
'0.000144566002200'
```