Roel Smeets Roel Smeets - 1 month ago 21
Python Question

Using Python to create a (random) sample of n-words from text files

For my PhD project I am evaluating all existing Named Entity Recogition Taggers for Dutch. In order to check the precision and recall for those taggers I want to manually annotate all Named Entities in a random sample from my corpus. That manually annotated sample will function as the 'gold standard' to which I will compare the results of the different taggers.

My corpus consists of 170 Dutch novels. I am writing a Python script to generate a random sample of a specific amount of words for each novel (which I will use to annotate afterwards). All novels will be stored in the same directory. The following script is meant to generate for each novel in that directory a random sample of n-lines:

import random
import os
import glob
import sys
import errno

path = '/Users/roelsmeets/Desktop/libris_corpus_clean/*.txt'
files = glob.glob(path)

for text in files:
try:
with open(text, 'rt', encoding='utf-8') as f:
# number of lines from txt file
random_sample_input = random.sample(f.readlines(),100)

except IOError as exc:
# Do not fail if a directory is found, just ignore it.
if exc.errno != errno.EISDIR:
raise


# This block of code writes the result of the previous to a new file
random_sample_output = open("randomsample", "w", encoding='utf-8')
random_sample_input = map(lambda x: x+"\n", random_sample_input)
random_sample_output.writelines(random_sample_input)
random_sample_output.close()


There are two problems with this code:


  1. Currently, I have put two novels (.txt files) in the directory. But the code only outputs a random sample for one of each novels.

  2. Currently, the code samples a random amount of LINES from each .txt file, but I prefer to generate a random amount of WORDS for each .txt file. Ideally, I would like to generate a sample of, say, the first or last 100 words of each of the 170 .txt-files. In that case, the sample won't be random at all; but thus far, I couldn't find a way to create a sample without using the random library.



Could anyone give a suggestion how to solve both problems? I am still new to Python and programming in general (I am a literary scholar), so I would be pleased to learn different approaches. Many thanks in advance!

Answer

A few suggestions:

Take random sentences, not words or lines. NE taggers will work much better if input is grammatical sentences. So you need to use a sentence splitter.

When you iterate over the files, random_sample_input contains lines from only the last file. You should move the block of code that writes the selected content to a file inside the for-loop. You can then write the selected sentences to either one file or into separate files. E.g.:

out = open("selected-sentences.txt", "w")

for text in files:
    try:
        with open(text, 'rt', encoding='utf-8') as f:
             sentences = sentence_splitter.split(f.readlines())
             for sentence in random.sample(sentences, 100):
                 print >> out, sentence

    except IOError as exc:
    # Do not fail if a directory is found, just ignore it.
        if exc.errno != errno.EISDIR: 
            raise 

out.close()

[edit] Here is how you should be able to use an NLTK sentence splitter:

import nltk.data
sentence_splitter = nltk.data.load("tokenizers/punkt/dutch.pickle")
text = "Dit is de eerste zin. Dit is de tweede zin."
print sentence_splitter.tokenize(text)

Prints:

["Dit is de eerste zin.", "Dit is de tweede zin."]

Note you'd need to download the Dutch tokenizer first, using nltk.download() from the interactive console.

Comments