PrimuS PrimuS - 1 month ago 8
Python Question

Inconsistent sorting with sort()

I have the following function to count words from a string and extract the top "n":

Function

def count_words(s, n):
"""Return the n most frequently occuring words in s."""

#Split words into list
wordlist = s.split()

#Count words
counts = Counter(wordlist)

#Get top n words
top_n = counts.most_common(n)

#Sort by first element, if tie by second
top_n.sort(key=lambda x: (-x[1], x[0]))

return top_n


So it sorts by occurance and if tied, alphabetically.
Following examples:

print count_words("cat bat mat cat cat mat mat mat bat bat cat", 3)


works (shows
[('cat', 4), ('mat', 4), ('bat', 3)]
)

print count_words("betty bought a bit of butter but the butter was bitter", 3)


does not work (shows
[('butter', 2), ('a', 1), ('bitter', 1)]
but should have
betty
instead of
bitter
as it they are tied and
be...
is before
bi...
)

print count_words("betty bought a bit of butter but the butter was bitter", 6)


works (shows
[('butter', 2), ('a', 1), ('betty', 1), ('bitter', 1), ('but', 1), ('of', 1)]
with
betty
before
bitter
as intended)

What could cause that (word-length maybe?) and how could I fix that?

Answer

You are asking for the top 3, and thus you cut of the data before you could have picked items in your specific sorting order.

Rather than have most_common() pre-sort then re-sort, use a heapq to sort by your custom criteria (provided n is smaller than the number of actual buckets):

import heapq

def count_words(s, n):
    """Return the n most frequently occuring words in s."""
    counts = Counter(s.split())
    key = lambda kv: (-kv[1], kv[0])
    if n >= len(counts):
        return sorted(counts.items(), key=key)
    return heapq.nsmallest(n, counts.items(), key=key)

On Python 2, you probably want to use iteritems() rather than items() for the above calls.

This re-creates the Counter.most_common() method, but with the updated key. Like the original, using a heapq makes sure this is bound to O(NlogK) performance rather than O(NlogN) (with N the number of buckets, and K the top element count you want to see).

Demo:

>>> count_words("cat bat mat cat cat mat mat mat bat bat cat", 3)
[('cat', 4), ('mat', 4), ('bat', 3)]
>>> count_words("betty bought a bit of butter but the butter was bitter", 3)
[('butter', 2), ('a', 1), ('betty', 1)]
>>> count_words("betty bought a bit of butter but the butter was bitter", 6)
[('butter', 2), ('a', 1), ('betty', 1), ('bit', 1), ('bitter', 1), ('bought', 1)]