Gabriele Picco Gabriele Picco - 1 year ago 136
Python Question

Most frequently occuring n words in a string

I have a problem with the following problem:


Implement a function count_words() in Python that takes as input a string s and a number n, and returns the n most frequently-occuring words in s. The return value should be a list of tuples - the top n words paired with their respective counts [(, ), (, ), ...], sorted in descending count order.

You can assume that all input will be in lowercase and that there will be no punctuations or other characters (only letters and single separating spaces). In case of a tie (equal count), order the tied words alphabetically.


print count_words("betty bought a bit of butter but the butter was bitter",3)

[('butter', 2), ('a', 1), ('betty', 1)]

This is my solution:

"""Count words."""

from operator import itemgetter
from collections import Counter

def count_words(s, n):
"""Return the n most frequently occuring words in s."""

# TODO: Count the number of occurences of each word in s
words = s.split(" ");
words = Counter(words)
# TODO: Sort the occurences in descending order (alphabetically in case of ties)
# TODO: Return the top n words as a list of tuples (<word>, <count>)
top_n = words.most_common(n)
return top_n

def test_run()

"""Test count_words() with some inputs."""
print(count_words("cat bat mat cat bat cat", 3))
print(count_words("betty bought a bit of butter but the butter was bitter", 3))

if __name__ == '__main__':

The problem is that elements with equal counts are ordered arbitrarily, how can i order that elements by alphabetical order ??

Answer Source

You can sort them using the number of occurrence (in reverse order) and then the lexicographical order:

>>> lst = [('meat', 2), ('butter', 2), ('a', 1), ('betty', 1)]
>>> sorted(lst, key=lambda x: (-x[1], x[0]))
#                              ^ reverse order 
[('butter', 2), ('meat', 2), ('a', 1), ('betty', 1)]

The number of occurrence takes precedence over the lex. order.

In your case, use words.items() in place of the list of the list I have used. You will no longer need to use most_common as sorted already does the same.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download