itsme itsme - 1 month ago 16
Java Question

Read annotated data from GATE datastore

I use GATE for manually annotating a large amount of texts by its contained emotions. To further process this text, I like to export that out of the datastore into my own Java application. I didn't found documentation about how to do that. I already wrote a program to import data into the datastore, but I don't have an idea how to get the annotated out of the datastore. I also tried to open the lucene based datastore using Luke (https://code.google.com/p/luke/). It's a tool, that is able to read a Lucene index. But it was not possible to open the Gate Lucene datastore using that tool :( Does anyone has an idea how to read the annotated text from the datastore?

Answer

You can use GATE APIs to load the documents from the datastore and then export them as GATE XML in the normal way (imports and exception handling omitted):

Gate.init();
DataStore ds = Factory.openDataStore("gate.creole.annic.SearchableDataStore", "file:/path/to/datastore");
List docIds = ds.getLrIds("gate.corpora.DocumentImpl");
for(Object id : docIds) {
  Document d = (Document)Factory.createResource("gate.corpora.DocumentImpl",
            gate.Utils.featureMap(DataStore.DATASTORE_FEATURE_NAME, ds,
                                  DataStore.LR_ID_FEATURE_NAME, id));
  try {
    File outputFile = new File(...); // based on doc name, sequential number, etc.
    DocumentStaxUtils.writeDocument(d, outputFile);
  } finally {
    Factory.deleteResource(d);
  }
}

If you want to write the annotations as inline XML then replace DocumentStaxUtils.writeDocument with something like

Set<String> types = new HashSet<String>();
types.add("Person");
types.add("Location"); // and whatever others you're interested in
FileUtils.write(outputFile, d.toXml(d.getAnnotations().get(types), true));

(I'm using FileUtils from Apache commons-io for convenience but you could equally handle opening and closing the file yourself).