host_255 host_255 - 4 months ago 22
Python Question

Google's Python Course wordcount.py

I am taking Google's Python Course, which uses Python 2.7. I am running 3.5.2.

The script functions. This was one of my exercises.

#!/usr/bin/python -tt
# Copyright 2010 Google Inc.
# Licensed under the Apache License, Version 2.0
# http://www.apache.org/licenses/LICENSE-2.0

# Google's Python Class
# http://code.google.com/edu/languages/google-python-class/

"""Wordcount exercise
Google's Python class

The main() below is already defined and complete. It calls print_words()
and print_top() functions which you write.

1. For the --count flag, implement a print_words(filename) function that counts
how often each word appears in the text and prints:
word1 count1
word2 count2
...

Print the above list in order sorted by word (python will sort punctuation to
come before letters -- that's fine). Store all the words as lowercase,
so 'The' and 'the' count as the same word.

2. For the --topcount flag, implement a print_top(filename) which is similar
to print_words() but which prints just the top 20 most common words sorted
so the most common word is first, then the next most common, and so on.

Use str.split() (no arguments) to split on all whitespace.

Workflow: don't build the whole program at once. Get it to an intermediate
milestone and print your data structure and sys.exit(0).
When that's working, try for the next milestone.

Optional: define a helper function to avoid code duplication inside
print_words() and print_top().

"""

import sys

# +++your code here+++
# Define print_words(filename) and print_top(filename) functions.
# You could write a helper utility function that reads a fcd ile
# and builds and returns a word/count dict for it.
# Then print_words() and print_top() can just call the utility function.

###

def word_count_dict(filename):
"""Returns a word/count dict for this filename."""
# Utility used by count() and Topcount().
word_count={} #Map each word to its count
input_file=open(filename, 'r')
for line in input_file:
words=line.split()
for word in words:
word=word.lower()
# Special case if we're seeing this word for the first time.
if not word in word_count:
word_count[word]=1
else:
word_count[word]=word_count[word] + 1
input_file.close() # Not strictly required, but good form.
return word_count

def print_words(filename):
"""Prints one per line '<word> <count>' sorted by word for the given file."""
word_count=word_count_dict(filename)
words=sorted(word_count.keys())
for word in words:
print(word,word_count[word])

def get_count(word_count_tuple):
"""Returns the count from a dict word/count tuple -- used for custom sort."""
return word_count_tuple[1]

def print_top(filename):
"""Prints the top count listing for the given file."""
word_count=word_count_dict(filename)

# Each it is a (word, count) tuple.
# Sort the so the big counts are first using key=get_count() to extract count.
items=sorted(word_count.items(), key=get_count, reverse=True)

# Print the first 20
for item in items[:20]:
print(item[0], item[1])

# This basic command line argument parsing code is provided and
# calls the print_words() and print_top() functions which you must define.
def main():
if len(sys.argv) != 3:
print('usage: ./wordcount.py {--count | --topcount} file')
sys.exit(1)

option = sys.argv[1]
filename = sys.argv[2]
if option == '--count':
print_words(filename)
elif option == '--topcount':
print_top(filename)
else:
print ('unknown option: ' + option)
sys.exit(1)

if __name__ == '__main__':
main()


Here are my questions that course is not answering:


  1. Where is says the following, I am unsure of what the
    1
    and
    +1
    mean. Does that mean
    if the word is not in the list, add it to the list? (word_count[word]=1)
    ? And, I don't understand what each part of this means, where it says
    word_count[word]=word_count[word] + 1
    .

    if not word in word_count:
    word_count[word]=1
    else:
    word_count[word]=word_count[word] + 1

  2. When it says
    word_count.keys()
    , I am not sure what that does other than it calls to the key in the dictionary we defined and loaded keys and values into. I just want to understand why the
    word_count.keys()
    is there.

    words=sorted(word_count.keys())

  3. word_count
    is redefined in a couple of locations, and I would like to know why instead of creating a new variable name such as
    word_count1
    .

    word_count={}
    word_count=word_count_dict(filename)
    ...and also in places outlined in my 1st question.

  4. Does
    if len(sys.argv) != 3:
    mean that if my arguments are not 3, or my characters not 3 (e.g.
    sys.argv[1]
    ,
    sys.argv[2]
    ,
    sys.argv[3]
    ?



Thank you for your help!

Answer
  1. If word is not already in the dictionary, we create a new entry in the dictionary for it, and set the value to the number 1, since we've so far just found 1 occurrence of the word. Otherwise, we retrieve the old value from the dictionary, use + 1 to add 1 to that value, and then put it back in the dictionary entry by assigning back to word_count[word]. This could also be written as:

    word_count[word] += 1
    
  2. word_count.keys() returns a list of all the keys in the word_count dictionary. This is being used so that the contents of the dictionary can be printed in alphabetical order, by using sort(). If you just printed the dictionary the way it is, the words will be in some unpredictable order.

  3. The variable is not being redefined. Variables are local to each function, so each word_count is a different variable. They just happen to use the same name in each function, because it's a good name for what the variable contains.

  4. List indexes start a 0, so if (len(sys.argv) != 3 checks that you have argv[0], argv[1], and argv[2]. argv[0] always contains the script name, so this is checking that you gave 2 arguments to the script. The first argument must be either --count or --topcount and the second argument must be the filename to count the words in.