user2870222 user2870222 - 23 days ago 18
Python Question

Training IOB Chunker using nltk.tag.brill_trainer (Transformation-Based Learning)

I'm trying to train a specific chunker (let's say a noun chunker for simplicity) by using NLTK's brill module. I'd like to use three features, ie. word, POS-tag, IOB-tag.


  • (Ramshaw and Marcus, 1995:7) have shown 100 templates which are generated from the combination of those three features, for example,

    W0, P0, T0 # current word, pos tag, iob tag
    W-1, P0, T-1 # prev word, pos tag, prev iob tag
    ...



I want to incorporate them into
nltk.tbl.feature
, but there are only two kinds of feature objects, ie.
brill.Word
and
brill.Pos
. Limited by the design, I could only put word and POS feature together like (word, pos), and thus used ( (word, pos), iob) as features for training. For example,

from nltk.tbl import Template
from nltk.tag import brill, brill_trainer, untag
from nltk.corpus import treebank_chunk
from nltk.chunk.util import tree2conlltags, conlltags2tree

# Codes from (Perkins, 2013)
def train_brill_tagger(initial_tagger, train_sents, **kwargs):
templates = [
brill.Template(brill.Word([0])),
brill.Template(brill.Pos([-1])),
brill.Template(brill.Word([-1])),
brill.Template(brill.Word([0]),brill.Pos([-1])),]
trainer = brill_trainer.BrillTaggerTrainer(initial_tagger, templates, trace=3,)
return trainer.train(train_sents, **kwargs)

# generating ((word, pos),iob) pairs as feature.
def chunk_trees2train_chunks(chunk_sents):
tag_sents = [tree2conlltags(sent) for sent in chunk_sents]
return [[((w,t),c) for (w,t,c) in sent] for sent in tag_sents]

>>> from nltk.tag import DefaultTagger
>>> tagger = DefaultTagger('NN')
>>> train = treebank_chunk.chunked_sents()[:2]
>>> t = chunk_trees2train_chunks(train)
>>> bt = train_brill_tagger(tagger, t)
TBL train (fast) (seqs: 2; tokens: 31; tpls: 4; min score: 2; min acc: None)
Finding initial useful rules...
Found 79 useful rules.

B |
S F r O | Score = Fixed - Broken
c i o t | R Fixed = num tags changed incorrect -> correct
o x k h | u Broken = num tags changed correct -> incorrect
r e e e | l Other = num tags changed incorrect -> incorrect
e d n r | e
------------------+-------------------------------------------------------
12 12 0 17 | NN->I-NP if Pos:NN@[-1]
3 3 0 0 | I-NP->O if Word:(',', ',')@[0]
2 2 0 0 | I-NP->B-NP if Word:('the', 'DT')@[0]
2 2 0 0 | I-NP->O if Word:('.', '.')@[0]


As shown above, (word, pos) are treated one feature as a whole. This is not a perfect capture of three features (word, pos-tag, iob-tag).


  • Any other ways to implement word, pos, iob features seperately into
    nltk.tbl.feature
    ?

  • If it is impossible in NLTK, are there other implementations of them in python? I was only able to find C++ and Java implementations on the internet.


Answer

The nltk3 brill trainer api (I wrote it) does handle training on sequences of tokens described with multidimensional features, as your data is an example of. However, the practical limits may be severe. The number of possible templates in multidimensional learning increases drastically, and the current nltk implementation of the brill trainer trades memory for speed, similar to Ramshaw and Marcus 1994, "Exploring the statistical derivation of transformation-rule sequences...". Memory consumption may be HUGE and it is very easy to give the system more data and/or templates than it can handle. A useful strategy is to rank templates according to how often they produce good rules (see print_template_statistics() in the example below). Usually, you can discard the lowest-scoring fraction (say 50-90%) with little or no loss in performance and a major decrease in training time.

Another or additional possibility is to use the nltk implementation of Brill's original algorithm, which has very different memory-speed tradeoffs; it does no indexing and so will use much less memory. It uses some optimizations and is actually rather quick in finding the very best rules, but is generally extremely slow towards end of training when there are many competing, low-scoring candidates. Sometimes you don't need those, anyway. For some reason this implementation seems to have been omitted from newer nltks, but here is the source (I just tested it) http://www.nltk.org/_modules/nltk/tag/brill_trainer_orig.html.

There are other algorithms with other tradeoffs, and in particular the fast memory-efficient indexing algorithms of Florian and Ngai 2000 (http://www.aclweb.org/anthology/N/N01/N01-1006.pdf) and probabilistic rule sampling of Samuel 1998 (https://www.aaai.org/Papers/FLAIRS/1998/FLAIRS98-045.pdf) would be a useful additions. Also, as you noticed, the documentation is not complete and too much focused on part-of-speech tagging, and it is not clear how to generalize from it. Fixing the docs is (also) on the todo list.

However, the interest for generalized (non-POS-tagging) tbl in nltk has been rather limited (the totally unsuited api of nltk2 was untouched for 10 years), so don't hold your breath. If you get impatient, you may wish to check out more dedicated alternatives, in particular mutbl and fntbl (google them, I only have reputation for two links).

Anyway, here is a quick sketch for nltk:

First, a hardcoded convention in nltk is that tagged sequences ('tags' meaning any label you would like to assign to your data, not necessarily part-of-speech) are represented as sequences of pairs, [(token1, tag1), (token2, tag2), ...]. The tags are strings; in many basic applications, so are the tokens. For instance, the tokens may be words and the strings their POS, as in

[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]

(As an aside, this sequence-of-token-tag-pairs convention is pervasive in nltk and its documentation, but it should arguably be better expressed as named tuples rather than pairs, so that instead of saying

[token for (token, _tag) in tagged_sequence]

you could say for instance

[x.token for x in tagged_sequence]

The first case fails on non-pairs, but the second exploits duck typing so that tagged_sequence could be any sequence of user-defined instances, as long as they have an attribute "token".)

Now, you could well have a richer representation of what a token is at your disposal. An existing tagger interface (nltk.tag.api.FeaturesetTaggerI) expects each token as a featureset rather than a string, which is a dictionary that maps feature names to feature values for each item in the sequence.

A tagged sequence may then look like

[({'word': 'Pierre', 'tag': 'NNP', 'iob': 'B-NP'}, 'NNP'),
 ({'word': 'Vinken', 'tag': 'NNP', 'iob': 'I-NP'}, 'NNP'),
 ({'word': ',',      'tag': ',',   'iob': 'O'   }, ','),
 ...
]

There are other possibilities (though with less support in the rest of nltk). For instance, you could have a named tuple for each token, or a user-defined class which allows you to add any amount of dynamic calculation to attribute access (perhaps using @property to offer a consistent interface).

The brill tagger doesn't need to know what view you currently provide on your tokens. However, it does require you to provide an initial tagger which can take sequences of tokens-in-your-representation to sequences of tags. You cannot use the existing taggers in nltk.tag.sequential directly, since they expect [(word, tag), ...]. But you may still be able to exploit them. The example below uses this strategy (in MyInitialTagger), and the token-as-featureset-dictionary view.

from __future__ import division, print_function, unicode_literals

import sys

from nltk import tbl, untag
from nltk.tag.brill_trainer import BrillTaggerTrainer
# or: 
# from nltk.tag.brill_trainer_orig import BrillTaggerTrainer
# 100 templates and a tiny 500 sentences (11700 
# tokens) produce 420000 rules and uses a 
# whopping 1.3GB of memory on my system;
# brill_trainer_orig is much slower, but uses 0.43GB

from nltk.corpus import treebank_chunk
from nltk.chunk.util import tree2conlltags
from nltk.tag import DefaultTagger


def get_templates():
    wds10 = [[Word([0])],
             [Word([-1])],
             [Word([1])],
             [Word([-1]), Word([0])],
             [Word([0]), Word([1])],
             [Word([-1]), Word([1])],
             [Word([-2]), Word([-1])],
             [Word([1]), Word([2])],
             [Word([-1,-2,-3])],
             [Word([1,2,3])]]

    pos10 = [[Tag([0])],
             [Tag([-1])],
             [Tag([1])],
             [Tag([-1]), Tag([0])],
             [Tag([0]), Tag([1])],
             [Tag([-1]), Tag([1])],
             [Tag([-2]), Tag([-1])],
             [Tag([1]), Tag([2])],
             [Tag([-1, -2, -3])],
             [Tag([1, 2, 3])]]

    iobs5 = [[IOB([0])],
             [IOB([-1]), IOB([0])],
             [IOB([0]), IOB([1])],
             [IOB([-2]), IOB([-1])],
             [IOB([1]), IOB([2])]]


    # the 5 * (10+10) = 100 3-feature templates 
    # of Ramshaw and Marcus
    templates = [tbl.Template(*wdspos+iob) 
        for wdspos in wds10+pos10 for iob in iobs5]
    # Footnote:
    # any template-generating functions in new code 
    # (as opposed to recreating templates from earlier
    # experiments like Ramshaw and Marcus) might 
    # also consider the mass generating Feature.expand()
    # and Template.expand(). See the docs, or for 
    # some examples the original pull request at
    # https://github.com/nltk/nltk/pull/549 
    # ("Feature- and Template-generating factory functions")

    return templates

def build_multifeature_corpus():
    # We cannot, of course, use truepos for 
    # training, so templates cannot refer to it.
    # But we may wish to keep it for reference.

    def tuple2dict_featureset(sent, tagnames=("word", "truepos", "iob")):
        return (dict(zip(tagnames, t)) for t in sent)

    def tag_tokens(tokens):
        return [(t, t["truepos"]) for t in tokens]
    # connlltagged_sents :: [[(word,tag,iob)]]
    connlltagged_sents = (tree2conlltags(sent) 
        for sent in treebank_chunk.chunked_sents())
    conlltagged_tokenses = (tuple2dict_featureset(sent) 
        for sent in connlltagged_sents)
    conlltagged_sequences = (tag_tokens(sent) 
        for sent in conlltagged_tokenses)
    return conlltagged_sequences

class Word(tbl.Feature):
    @staticmethod
    def extract_property(tokens, index):
        return tokens[index][0]["word"]

class IOB(tbl.Feature):
    @staticmethod
    def extract_property(tokens, index):
        return tokens[index][0]["iob"]

class Tag(tbl.Feature):
    @staticmethod
    def extract_property(tokens, index):
        return tokens[index][1]


class MyInitialTagger(DefaultTagger):
    def choose_tag(self, tokens, index, history):
        tokens_ = [t["word"] for t in tokens]
        return super().choose_tag(tokens_, index, history)


def main(argv):
    templates = get_templates()
    trainon = 100

    corpus = list(build_multifeature_corpus())
    train, test = corpus[:trainon], corpus[trainon:]

    print(train[0], "\n")

    initial_tagger = MyInitialTagger('NN')
    print(initial_tagger.tag(untag(train[0])), "\n")

    trainer = BrillTaggerTrainer(initial_tagger, templates, trace=3)
    tagger = trainer.train(train)

    taggedtest = tagger.tag_sents([untag(t) for t in test])
    print(test[0])
    print(initial_tagger.tag(untag(test[0])))
    print(taggedtest[0])
    print()

    tagger.print_template_statistics()

if __name__ == '__main__':
    sys.exit(main(sys.argv))
Comments