sparsity sparsity - 2 months ago 13
Python Question

Implementing an algorithm to determine if a string has all unique characters

Context: I'm a CS n00b working my way through "Cracking the Coding Interview." The first problem asks to "implement an algorithm to determine if a string has all unique characters." My (likely naive) implementation is as follows:

def isUniqueChars2(string):
uchars = []
for c in string:
if c in uchars:
return False
else:
uchars.append(c)
return True


The author suggests the following implementation:

def isUniqueChars(string):
checker = 0
for c in string:
val = ord(c) - ord('a')
if (checker & (1 << val) > 0):
return False
else:
checker |= (1 << val)
return True


What makes the author's implementation better than mine (FWIW, the author's solution was in Java and I converted it to Python -- is my solution one that is not possible to implement in Java)? Or, more generally, what is desirable in a solution to this problem? What is wrong with the approach I've taken? I'm assuming there are some fundamental CS concepts (that I'm not familiar with) that are important and help inform the choice of which approach to take to this problem.

Answer

Here is how I would write this:

def unique(s):
    return len(set(s)) == len(s)

Strings are iterable so you can pass your argument directly to set() to get a set of the characters from the string (which by definition will not contain any duplicates). If the length of that set is the same as the length of the original string then you have entirely unique characters.

Your current approach is fine and in my opinion it is much more Pythonic and readable than the version proposed by the author, but you should change uchars to be a set instead of a list. Sets have O(1) membership test so c in uchars will be considerably faster on average if uchars is a set rather than a list. So your code could be written as follows:

def unique(s):
    uchars = set()
    for c in s:
        if c in uchars:
            return False
        uchars.add(c)
    return True

This will actually be more efficient than my version if the string is large and there are duplicates early, because it will short-circuit (exit as soon as the first duplicate is found).