I want to filer some tokens from a list by the following conditions.
1) token length greater than 5
2) the frequency of appearance (in the original text) more than 100
I used the following code
#token_list is a list object containing tokenized words from raw text
from collections import Counter
c = Counter(token_list)
selected_tokens = [word for word in token_list if len(word) > 5 and c.item > 100]
Did someone say
selected_tokens = list(filter(lambda x: len(x) > 5 and c[x] > 100, token_list))
Also, you access the counter count using
c[...]. Also, you might want to be wary of case issues (the same word present in different case).
If you want speed, use a list comprehension instead:
selected_tokens = [x for x in token_list if len(x) > 5 and c[x] > 100]
If you are looking to obtain words satisfying your condition without unwanted duplicates, work on a
set instead of a set:
token_set = set(token_list) selected_tokens = [x for x in token_set if if len(x) > 5 and c[x] > 100]
Beware, order is lost. If you want order without duplicates, use an
OrderedDict (python < 3.6 or
dict (python >= 3.6).
dict_ = OrderedDict() for t in token_list: dict_[t] = None selected_tokens = [x for x in dict_ if len(x) > 5 and c[x] > 100]
dict doesn't do it, you can look at the
OrderedSet recipe and implement something to the same effect:
token_set = OrderedSet(token_list) selected_tokens = [x for x in token_set if ...] # as usual