For my PhD project I am evaluating all existing Named Entity Recogition Taggers for Dutch. In order to check the precision and recall for those taggers I want to manually annotate all Named Entities in a random sample from my corpus. That manually annotated sample will function as the 'gold standard' to which I will compare the results of the different taggers.
My corpus consists of 170 Dutch novels. I am writing a Python script to generate a random sample of a specific amount of words for each novel (which I will use to annotate afterwards). All novels will be stored in the same directory. The following script is meant to generate for each novel in that directory a random sample of n-lines:
path = '/Users/roelsmeets/Desktop/libris_corpus_clean/*.txt'
files = glob.glob(path)
for text in files:
with open(text, 'rt', encoding='utf-8') as f:
# number of lines from txt file
random_sample_input = random.sample(f.readlines(),100)
except IOError as exc:
# Do not fail if a directory is found, just ignore it.
if exc.errno != errno.EISDIR:
# This block of code writes the result of the previous to a new file
random_sample_output = open("randomsample", "w", encoding='utf-8')
random_sample_input = map(lambda x: x+"\n", random_sample_input)
A few suggestions:
Take random sentences, not words or lines. NE taggers will work much better if input is grammatical sentences. So you need to use a sentence splitter.
When you iterate over the files,
random_sample_input contains lines from only the last file. You should move the block of code that writes the selected content to a file inside the for-loop. You can then write the selected sentences to either one file or into separate files. E.g.:
out = open("selected-sentences.txt", "w") for text in files: try: with open(text, 'rt', encoding='utf-8') as f: sentences = sentence_splitter.split(f.readlines()) for sentence in random.sample(sentences, 100): print >> out, sentence except IOError as exc: # Do not fail if a directory is found, just ignore it. if exc.errno != errno.EISDIR: raise out.close()
 Here is how you should be able to use an NLTK sentence splitter:
import nltk.data sentence_splitter = nltk.data.load("tokenizers/punkt/dutch.pickle") text = "Dit is de eerste zin. Dit is de tweede zin." print sentence_splitter.tokenize(text)
["Dit is de eerste zin.", "Dit is de tweede zin."]
Note you'd need to download the Dutch tokenizer first, using
nltk.download() from the interactive console.