I am doing some natural language processing with Python (2.7.9) and NLTK (3.2.1). The way I am currently doing things, every time I run my program I do part-of-speech tagging on a large corpus.
The resulting tagged corpus looks like a larger version of this:
[('a', 'DT'), ('better', 'JJR'), ('widower', 'JJR'), ('than', 'IN'),
('my', 'PRP$'), ('father', 'NN'), ('.', '.'), ('Aunt', 'NNP'),
('Sybil', 'NNP'), ('had', 'VBD'), ('pink-rimmed', 'JJ'), ('azure',
'JJ'), ('eyes', 'NNS'), ('and', 'CC'), ('a', 'DT'), ('waxen', 'JJ'),
('complexion', 'NN'), ('.', '.'), ('She', 'PRP'), ('wrote', 'VBD'),
('poetry', 'NN'), ('.', '.'), ('She', 'PRP'), ('was', 'VBD'),
('poetically', 'RB'), ('superstitious', 'JJ')]
POScorpus = pos_tag(words)
#I convert this to a string so I can write it to a file.
POScorpus_string = str(POScorpus)
#I then write it to a file.
f = open('C:\Desktop\POScorpus.txt', 'w')
A string can be transformed into a list using the
eval() function. That said, this is not the most efficient and memory-friendly solution to the problem.
A better option is to use Python's
cPickle module. "Pickling" refers to the process of saving a Python object (for example, a list or dictionary) as a byte stream which can then be quickly unloaded into variables later, without loss or deformation of its object type. Pickling is also known as "serialization" and "marshalling".
Here is an example:
#HOW TO PICKLE THE POS-TAGGED CORPUS #Pickling involves saving a Python object as a file (without first converting #it to a string). #Let's pickle TaggedCorpus so we can use it efficiently later: import cPickle #imports fast pickle module (written in C) f = open('C:\Desktop\TaggedCorpus.p', 'w') #creates pickle file f cPickle.dump(TaggedCorpus, f) #dumps data of TaggedCorpus object to f f.close() #To unpickle the object, simply load the file into a variable: f = open('C:\Desktop\TaggedCorpus.p', 'r') #opens the pickle file as read TaggedCorpus = cPickle.load(f) #loads the content of f as TaggedCorpus f.close()