user2988577 user2988577 - 1 month ago 12
Python Question

Building up a counting function

I need to build up a counting function starting from a dictionary. The dictionary is a classical Bag_of_Words and looks like as follows:

D={'the':5, 'pow':2, 'poo':2, 'row':2, 'bub':1, 'bob':1}


I need the function that for a given integer returns the number of words with at least that number of occurrences. In the example F(2)=4, all words but 'bub' and 'bob'.

First of all I build up the inverse dictionary of D:

ID={5:1, 2:3, 1:2}


I think I'm fine with that. Then here is the code:

values=list(ID.keys())
values.sort(reverse=True)
Lk=[]
Nw=0
for val in values:
Nw=Nw+ID[val]
Lk.append([Nw, val])


The code works fine but I do not like it. The point is that I would prefer to use a list comprehension to build up Lk; also I really ate the Nw variable I have used. It does not seems pythonic at all

Answer

you can create a sorted array of your word counts then find the insertion point with np.searchsorted to get how many items are to either side of it... np.searchsorted is very efficient and fast. If your dictionary doesn't change often this call is basically free compared to other methods

import numpy as np

def F(n, D):
    #creating the array each time would be slow if it doesn't change move this 
    #outside the function
    arr = np.array(D.values())
    arr.sort()
    L = len(arr)
    return L - np.searchsorted(arr, n) #this line does all the work...

what's going on....

first we take just the word counts (and convert to a sorted array)...

D = {"I'm": 12, "pretty": 3, "sure":12, "the": 45, "Donald": 12, "is": 3, "on": 90, "crack": 11}
vals = np.arrau(D.values())
#vals = array([90, 12, 12,  3, 11, 12, 45,  3])

vals.sort()
#vals = array([ 3,  3, 11, 12, 12, 12, 45, 90])

then if we want to know how many values are greater than or equal to n, we simply find the length of the list beyond the first number greater than or equal to n. We do this by determining the leftmost index where n would be inserted (insertion sort) and subtracting that from the total number of positions (len)

# how many are >= 10?

# insertion point for value of 10..
#
#               | index: 2
#               v
# array([ 3,  3, 11, 12, 12, 12, 45, 90])

#find how many elements there are
#len(arr) = 8

#subtract.. 2-8 = 6 elements that are >= 10