Parth Bhoiwala Parth Bhoiwala - 6 months ago 31
Python Question

Why sorting in the end is faster than inserting in sorting order?

I am generating 100 random integers and I want to store them in a sorted array. The first approach that I tried was using a binary search to find the proper index to store each number at and then insert the number at that index. This way, after 100th random number, I will have a sorted array. Binary search has a time complexity of

log(N)
and
insert()
method has a time complexity of
N
so the final Big-O should be
O(Nlog(N))
right?

Below is the code for this approach:

def binary_search(start, end, item):
mid = (start + end)/2
if item > mlist[end]:
return end+1
elif item > mlist[mid]:
return binary_search(mid+1, end, item)
elif item < mlist[mid]:
return binary_search(start, mid-1, item)
else:
return int(math.ceil(mid))

begin = time.time()
for i in range(100):
rand = randint(0,100)
index = binary_search(0,len(mlist)-1,rand)
mlist.insert(index,rand)
elapsed = time.time()
print((elapsed-begin)*(10**4))


When I printed the difference between elapsed and begin time, I got 4.2414 microseconds.

My second approach was to simply add all random numbers in an array/list and then use the built in
sort()
method to sort it. The time complexity for
sort()
method is
Nlog(N)
.

begin = time.time()
mlist=[]
for i in range(100):
rand = randint(0,100)
mlist.append(rand)
mlist = sorted(mlist)
elapsed = time.time()
print((elapsed-begin)*(10**4))


The elapsed time for this approach was 1.9407 microseconds.

I don't understand that if the time complexities for both methods are same then what makes the second approach so much faster?

Answer Source

Your binary search insertion is O(N^2); each insertion has to move up to O(N) elements up one step to the right, and you do this N times. However, even if it was O(NlogN), the constant cost of the sorting code is far lower than your Python code could match.

If you do want to stick to bisect insertion sorting, rather than re-invent the (admittedly simple) bisect wheel, do use the bisect module. This module comes with a C-optimised implementation.

It even has bisect.insort*() functions, which note:

Keep in mind that the O(log n) search is dominated by the slow O(n) insertion step.

Another tip: don't use wall-clock time to measure algorithms. Use the timeit module, which disables the garbage collector, uses the most accurate clock available, and runs your test multiple times to eliminate external factors.

Next, don't include creating the random values, you don't want to time how fast those can be produced; produce one list, up front, and re-use it for all timings.

Next, use a proper bisect() function, yours is broken for any len(mlist) < 2; there is no need to test for mlist[end] for example. The following avoids an off-by-one error and uses mlist as a parameter rather than a global:

def binary_search(mlist, item, start=0, end=None):
    if end is None:
        end = len(mlist)
    if start >= end:
        return start
    mid = (start + end) // 2
    if item >= mlist[mid]:
        return binary_search(mlist, item, mid + 1, end)
    else:
        return binary_search(mlist, item, start, mid)

Use binary_search(mlist, value) to call it, the start and end values are filled in for you.

Now you can conduct a proper timed test:

>>> import random, timeit
>>> values = [random.randint(0, 100) for _ in range(100)]
>>> def binsort(l):
...     mlist = []
...     for i in l:
...         index = binary_search(0, len(mlist), i, mlist)
...         mlist.insert(index, i)
...     return mlist
...
>>> count, time = timeit.Timer('binsort(values)', 'from __main__ import values, binsort').autorange()
>>> format(time / count, '.15f')
'0.000146628010299'
>>> count, time = timeit.Timer('sorted(values)', 'from __main__ import values').autorange()
>>> format(time / count, '.15f')
'0.000008379445840'
>>> values = [random.randint(0, 100) for _ in range(1000)]
>>> count, time = timeit.Timer('binsort(values)', 'from __main__ import values, binsort').autorange()
>>> format(time / count, '.15f')
'0.002460538140149'
>>> count, time = timeit.Timer('sorted(values)', 'from __main__ import values').autorange()
>>> format(time / count, '.15f')
'0.000144566002200'