Nik P Nik P - 4 months ago 37
Python Question

HMM loaded from pickle looks untrained

I am trying to serialise nltk.tag.hmm.HiddenMarkovModelTagger into a pickle to use it when needed without re-training. However, after loading from .pkl my HMM looks untrained. My two questions here are:


  1. What am I doing wrong?

  2. Is it a good idea at all to serialise HMM
    when one has a big dataset?



Here's the code:

In [1]: import nltk

In [2]: from nltk.probability import *

In [3]: from nltk.util import unique_list

In [4]: import json

In [5]: with open('data.json') as data_file:
...: corpus = json.load(data_file)
...:

In [6]: corpus = [[tuple(l) for l in sentence] for sentence in corpus]

In [7]: tag_set = unique_list(tag for sent in corpus for (word,tag) in sent)

In [8]: symbols = unique_list(word for sent in corpus for (word,tag) in sent)

In [9]: trainer = nltk.tag.HiddenMarkovModelTrainer(tag_set, symbols)

In [10]: train_corpus = corpus[:4]

In [11]: test_corpus = [corpus[4]]

In [12]: hmm = trainer.train_supervised(train_corpus, estimator=LaplaceProbDist)

In [13]: print('%.2f%%' % (100 * hmm.evaluate(test_corpus)))
100.00%


As you can see HMM is trained. Now I pickle it:

In [14]: import pickle

In [16]: output = open('hmm.pkl', 'wb')

In [17]: pickle.dump(hmm, output)

In [18]: output.close()


After reset and load the model looks dumber than a box of rocks:

In [19]: %reset
Once deleted, variables cannot be recovered. Proceed (y/[n])? y

In [20]: import pickle

In [21]: import json

In [22]: with open('data.json') as data_file:
....: corpus = json.load(data_file)
....:

In [23]: test_corpus = [corpus[4]]

In [24]: pkl_file = open('hmm.pkl', 'rb')

In [25]: hmm = pickle.load(pkl_file)

In [26]: pkl_file.close()

In [27]: type(hmm)
Out[27]: nltk.tag.hmm.HiddenMarkovModelTagger

In [28]: print('%.2f%%' % (100 * hmm.evaluate(test_corpus)))
0.00%

Answer

1) After In[22], you need to add -

corpus = [[tuple(l) for l in sentence] for sentence in corpus]

2) Re-training model every time for testing purpose will be time consuming. So, It is good to pickle.dump your model and load it.

Comments