bean bean - 1 month ago 10
Python Question

Why is my implementation of binary search very inefficient?

I am doing a Python exercise to search a

word
from a given sorted
wordlist
, containing more than 100,000 words.

When using
bisect_left
from the Python
bisect
module
, it is very efficient, but using the binary method created by myself is very inefficient. Could anyone please clarify why?

This is the searching method using the Python
bisect
module:

def in_bisect(word_list, word):
"""Checks whether a word is in a list using bisection search.

Precondition: the words in the list are sorted

word_list: list of strings
word: string
"""
i = bisect_left(word_list, word)
if i != len(word_list) and word_list[i] == word:
return True
else:
return False


My implementation is really very inefficient (don't know why):

def my_bisect(wordlist,word):
"""search the given word in a wordlist using
bisection search, also known as binary search
"""
if len(wordlist) == 0:
return False
if len(wordlist) == 1:
if wordlist[0] == word:
return True
else:
return False

if word in wordlist[len(wordlist)/2:]:
return True

return my_bisect(wordlist[len(wordlist)/2:],word)

Answer
if word in wordlist[len(wordlist)/2:] 

will make Python search through half of your wordlist, which is kinda defeating the purpose of writing a binary search in the first place. Also, you are not splitting the list in half correctly. The strategy for binary search is to cut the search space in half each step, and then only apply the same strategy to the half which your word could be in. In order to know which half is the right one to search, it is critical that the wordlist is sorted. Here's a sample implementation which keeps track of the number of calls needed to verify whether a word is in wordlist.

import random

numcalls = 0
def bs(wordlist, word):
    # increment numcalls
    print('wordlist',wordlist)
    global numcalls
    numcalls += 1

    # base cases
    if not wordlist:
        return False
    length = len(wordlist)
    if length == 1:
        return wordlist[0] == word

    # split the list in half
    mid = int(length/2) # mid index
    leftlist = wordlist[:mid]
    rightlist = wordlist[mid:]
    print('leftlist',leftlist)
    print('rightlist',rightlist)
    print()

    # recursion
    if word < rightlist[0]:
        return bs(leftlist, word) # word can only be in left list
    return bs(rightlist, word) # word can only be in right list

alphabet = 'abcdefghijklmnopqrstuvwxyz'
wl = sorted(random.sample(alphabet, 10))
print(bs(wl, 'm'))
print(numcalls)

I included some print statements so you can see what is going on. Here are two sample outputs. First: word is in the wordlist:

wordlist ['b', 'c', 'g', 'i', 'l', 'm', 'n', 'r', 's', 'v']
leftlist ['b', 'c', 'g', 'i', 'l']
rightlist ['m', 'n', 'r', 's', 'v']

wordlist ['m', 'n', 'r', 's', 'v']
leftlist ['m', 'n']
rightlist ['r', 's', 'v']

wordlist ['m', 'n']
leftlist ['m']
rightlist ['n']

wordlist ['m']
True
4

Second: word is not in the wordlist:

wordlist ['a', 'c', 'd', 'e', 'g', 'l', 'o', 'q', 't', 'x']
leftlist ['a', 'c', 'd', 'e', 'g']
rightlist ['l', 'o', 'q', 't', 'x']

wordlist ['l', 'o', 'q', 't', 'x']
leftlist ['l', 'o']
rightlist ['q', 't', 'x']

wordlist ['l', 'o']
leftlist ['l']
rightlist ['o']

wordlist ['l']
False
4

Note that if you double the size of the wordlist, i.e. use

wl = sorted(random.sample(alphabet, 20))

numcalls on average will be only one higher than for a wordlist of length 10, because wordlist has to be split in half only once more.