Gonçalo Correia Gonçalo Correia - 5 months ago 8
Python Question

Finding one or more strings of a certain kind in a list of tuples

Let's say I have the following tuples:

tagged = [('They', 'PRP'),
('refuse', 'VBP'),
('to', 'TO'),
('permit', 'VB'),
('us', 'PRP'),
('to', 'TO'),
('obtain', 'VB'),
('the', 'DT'),
('refuse', 'NN'),
('permit', 'NN')]


I want to take all combinations of one or more nouns (that are in a sequence). Therefore, the output would be:

['refuse','permit','refuse permit']


I'm able to get the first two like this:

filtered = [x[0] for x in tagged if x[1]=='NN']


But I'm currently unable to find a way to get sequences of
'NN'
in the list.

EDIT:

This list is a better example:

[('If', 'IN'),
('the', 'DT'),
('company', 'NN'),
('name', 'NN'),
('or', 'CC'),
('job', 'NN'),
('title', 'NN'),
('includes', 'VBZ'),
('multiple', 'JJ'),
('words', 'NNS'),
(',', ','),
('use', 'NN'),
('double', 'JJ'),
('quotation', 'NN'),
('marks', 'NNS'),
('.', '.')]


Should return:

['company', 'name', 'company name', 'job', 'title', 'job title', 'use', 'quotation']

Answer

This is a pretty simple groupby operation with a little processing. If we group by the tags and only look at the groups of nouns then we're almost there. The only thing to be done then is join the groups that have more than 1 item and put the stuff in the output in the proper order:

from itertools import groupby

def group_nouns(iterable):
    for key, group in groupby(iterable, key=lambda t: t[1]):
        if key == 'NN':  # only worry about groups of nouns.
            seq = [t[0] for t in group]  # drop tags.
            if len(seq) == 1:
                yield seq[0]
            else:
                for noun in seq:
                    yield noun
                yield ' '.join(seq)