Chris T. Chris T. - 5 months ago 20
Python Question

Filtering tokens from a list by multiple conditions

I want to filer some tokens from a list by the following conditions.
1) token length greater than 5
2) the frequency of appearance (in the original text) more than 100

I used the following code

#token_list is a list object containing tokenized words from raw text

from collections import Counter
c = Counter(token_list)
selected_tokens = [word for word in token_list if len(word) > 5 and c.item[2] > 100]

selected_tokens


But can't seem to get it. I believe the error came from 'c.item[2]' but don't quite understand the mechanics behind the 'Counter()' command.

It will be really appreciated if someone could enlighten me on this.

Thank you.

Answer Source

Did someone say filter?

selected_tokens = list(filter(lambda x: len(x) > 5 and c[x] > 100, token_list))

Also, you access the counter count using c[...]. Also, you might want to be wary of case issues (the same word present in different case).


If you want speed, use a list comprehension instead:

selected_tokens = [x for x in token_list if len(x) > 5 and c[x] > 100]

If you are looking to obtain words satisfying your condition without unwanted duplicates, work on a set instead of a set:

token_set = set(token_list)
selected_tokens = [x for x in token_set if if len(x) > 5 and c[x] > 100]

Beware, order is lost. If you want order without duplicates, use an OrderedDict (python < 3.6 or dict (python >= 3.6).

dict_ = OrderedDict()
for t in token_list:
    dict_[t] = None

selected_tokens = [x for x in dict_ if len(x) > 5 and c[x] > 100]

If a dict doesn't do it, you can look at the OrderedSet recipe and implement something to the same effect:

token_set = OrderedSet(token_list)
selected_tokens = [x for x in token_set if ...] # as usual