Vibes Vibes - 1 month ago 10
Python Question

Returning list of words counted on different list

Good afternoon guys,

Today i have been asked to write the following function:

def compareurl(url1,url2,enc,n)


This function compares two urls and return a list containing:

[word,occ_in_url1,occ_in_u2]


where:

word ---> word with n lenght

occ_in_url1 ---> times word in url1

occ_in_url2 ---> times word in url2

So I started writing the function, this is what i have wrote so far:

def compare_url(url1,url2,enc,n):
from urllib.request import urlopen
with urlopen('url1') as f1:
readpage1 = f1.read()
decodepage1 = readpage1.decode('enc')
with urlopen('url2') as f2:
readpage2 = f2.read()
decodepage2 = readpage2.decode('enc')
all_lower1 = decodepage1.lower()
all_lower2 = decodepage2.lower()
import string
all_lower1nopunctuation = "".join(l for l in all_lower1 if l not in string.punctuation)
all_lower2nopunctuation = "".join(l for l in all_lower2 if l not in string.punctuation)
for word1 in all_lower1nopunctuation:
if len(word1) == k:
all_lower1nopunctuation.count(word1)
for word2 in all_lower2nopunctuation:
if len(word2) == k:
all_lower2opunctuation.count(word2)
return(word1,all_lower1nopunctuation.count(word1),all_lower2opunctuation.count(word2))
return(word2,all_lower1nopunctuation.count(word1),all_lower2opunctuation.count(word2))


But this code doesn't work in the way I thought, actually it doesn't work at all.

I would also like to:


  1. sort the returning list decreasingly (from the word which return the most times)

  2. if 2 words occurs the same number of times, they must be returned in
    alphabetical order


Answer

There are some typos in your code (watch out for those in the future) but there are some python problems (or things that can be improved) as well.


First of all, your imports should come in the top of the document

from urllib.request import urlopen
import string

You should call urlopen with a string, and that's what you are doing, but this string is 'url1' and not 'http://...'. You don't use variables inside quotes:

with urlopen(url1) as f1: #remove quotes
    readpage1 = f1.read()
    decodepage1 = readpage1.decode(enc) #remove quotes
with urlopen(url2) as f2: #remove quotes
    readpage2 = f2.read()
    decodepage2 = readpage2.decode(enc) #remove quotes

You need to improve your all_lower1nopunctuation initialization. You are replacing stackoverflow.com with stackoverflowcom, which should actually be stackoverflow com.

#all_lower1nopunctuation = "".join(l for l in all_lower1 if l not in string.punctuation)
#the if statement should be after 'l' and before 'for'
#you should include 'else' to replace the punctuation with a space
all_lower1nopunctuation = ''.join(l if l not in string.punctuation
else ' ' for l in all_lower1)
all_lower2nopunctuation = ''.join(l if l not in string.punctuation
else ' ' for l in all_lower2)

Merge both for into one. Also add the found word in a set (list of unique elements).

all_lower1nopunctuation.count(word1) returns the number of times word1 appears in all_lower1nopunctuation. It doesn't increment a counter.

for word1 in all_lower1nopunctuation doesn't work because all_lower1nopunctuation is a string (and not a list). Transform it into a list with .split(' ').

.replace('\n', '') removes all line breaks, otherwise they would be counted as words too.

#for word1 in all_lower1nopunctuation:
#    if len(word1) == k: #also, this should be == n, not == k
#        all_lower1nopunctuation.count(word1)
#for word2 in all_lower2nopunctuation:
#    if len(word2) == k:
#        all_lower2opunctuation.count(word2)

word_set = set([])
for word in all_lower1nopunctuation.replace('\n', '').split(' '):
    if len(word) == n and word in all_lower2nopunctuation:
        word_set.add(word) #set uses .add() instead of .append()

Now that you have a set of words that appear on both urls, you need to store how many word is in each url. The following code will ensure you have a list of tuples as you asked

count_list = []
for final_word in word_set:
    count_list.append((final_word,
    all_lower1nopunctuation.count(final_word),
    all_lower2nopunctuation.count(final_word)))

Returning means the function is finished and the interpreter continues wherever it was before the function was called, so whatever comes after the return is irrelevant.

As said by RemcoGerlich.

Your code will always only return the first return, so you need to merge both returns into one.

#return(word1,all_lower1nopunctuation.count(word1),all_lower2opunctuation.count(word2))
#return(word2,all_lower1nopunctuation.count(word1),all_lower2opunctuation.count(word2))
return(count_list) # which contains a list of tuples with all words and its counts

TL;DR

from urllib.request import urlopen
import string

def compare_url(url1,url2,enc,n):
    with urlopen(url1) as f1:
        readpage1 = f1.read()
        decodepage1 = readpage1.decode(enc)
    with urlopen(url2) as f2:
        readpage2 = f2.read()
        decodepage2 = readpage2.decode(enc)

    all_lower1 = decodepage1.lower()
    all_lower2 = decodepage2.lower()

    all_lower1nopunctuation = ''.join(l if l not in string.punctuation
    else ' ' for l in all_lower1)
    all_lower2nopunctuation = ''.join(l if l not in string.punctuation
    else ' ' for l in all_lower2)

    word_set = set([])
    for word in all_lower1nopunctuation.replace('\n', '').split(' '):
        if len(word) == n and word in all_lower2nopunctuation:
            word_set.add(word)

    count_list = []
    for final_word in word_set:
        count_list.append((final_word,
        all_lower1nopunctuation.count(final_word),
        all_lower2nopunctuation.count(final_word)))

    return(count_list)

url1 = 'https://www.tutorialspoint.com/python/list_count.htm'
url2 = 'http://stackoverflow.com/a/128577/7067541'

for word_count in compare_url(url1,url2, 'utf-8', 5):
    print (word_count)
Comments