For my PhD project I am evaluating all existing Named Entity Recogition Taggers for Dutch. In order to check the precision and recall for those taggers I want to manually annotate all Named Entities in a random sample from my corpus. That manually annotated sample will function as the 'gold standard' to which I will compare the results of the different taggers.
My corpus consists of 170 Dutch novels. I am writing a Python script to generate a random sample of a specific amount of words for each novel (which I will use to annotate afterwards). All novels will be stored in the same directory. The following script is meant to generate for each novel in that directory a random sample of n-lines:
import random
import os
import glob
import sys
import errno
path = '/Users/roelsmeets/Desktop/libris_corpus_clean/*.txt'
files = glob.glob(path)
for text in files:
try:
with open(text, 'rt', encoding='utf-8') as f:
# number of lines from txt file
random_sample_input = random.sample(f.readlines(),100)
except IOError as exc:
# Do not fail if a directory is found, just ignore it.
if exc.errno != errno.EISDIR:
raise
# This block of code writes the result of the previous to a new file
random_sample_output = open("randomsample", "w", encoding='utf-8')
random_sample_input = map(lambda x: x+"\n", random_sample_input)
random_sample_output.writelines(random_sample_input)
random_sample_output.close()
A few suggestions:
Take random sentences, not words or lines. NE taggers will work much better if input is grammatical sentences. So you need to use a sentence splitter.
When you iterate over the files, random_sample_input
contains lines from only the last file. You should move the block of code that writes the selected content to a file inside the for-loop. You can then write the selected sentences to either one file or into separate files. E.g.:
out = open("selected-sentences.txt", "w")
for text in files:
try:
with open(text, 'rt', encoding='utf-8') as f:
sentences = sentence_splitter.split(f.readlines())
for sentence in random.sample(sentences, 100):
print >> out, sentence
except IOError as exc:
# Do not fail if a directory is found, just ignore it.
if exc.errno != errno.EISDIR:
raise
out.close()
[edit] Here is how you should be able to use an NLTK sentence splitter:
import nltk.data
sentence_splitter = nltk.data.load("tokenizers/punkt/dutch.pickle")
text = "Dit is de eerste zin. Dit is de tweede zin."
print sentence_splitter.tokenize(text)
Prints:
["Dit is de eerste zin.", "Dit is de tweede zin."]
Note you'd need to download the Dutch tokenizer first, using nltk.download()
from the interactive console.