AlmightyR AlmightyR - 1 month ago 14
Java Question

Java: Apache POI: Can I get clean text from MS Word (.doc) files?

The strings I'm (programmatically) getting from MS Word files when using Apache POI are not the same text I can look at when I open the files with MS Word.

When using the following code:

File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
HWPFDocument wordDoc = new HWPFDocument(inputStrm);

the output is a single line with many 'invalid' characters (yes, the 'boxes'), and many unwanted strings, like "
", "
HYPERLINK \l "_Toc##########"
" ('#' being numeric digits), "
PAGEREF _Toc########## \h 4
", etc.

The following code "fixes" the single-line problem, but maintains all the invalid characters and unwanted text:

File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
WordExtractor wordExtractor = new WordExtractor(inputStrm);
for(String paragraph:wordExtractor.getParagraphText()){

I don't know if I'm using the wrong method for extracting the text, but that's what I've come up with when looking at POI's quick-guide. If I am, what is the correct approach?

If that output is correct, is there a standard way for getting rid of the unwanted text, or will I have to write a filter of my own?


There are two options, one provided directly in Apache POI, the other via Apache Tika (which uses Apache POI internally).

The first option is to use WordExtractor, but wrap it in a call to stripFields(String) when calling it. That will remove the text based fields included in the text, things like HYPERLINK that you've seen. Your code would become:

NPOIFSFileSystem fs = new NPOIFSFileSytem(file);
WordExtractor extractor = new WordExtractor(fs.getRoot());

for(String rawText : extractor.getParagraphText()) {
String text = extractor.stripFields(rawText);

The other option is to use Apache Tika. Tika provides text extraction, and metadata, for a wide variety of files, so the same code will work for .doc, .docx, .pdf and many others too. To get clean, plain text of your word document (you can also get XHTML if you'd rather), you'd do something like:

TikaConfig tika = TikaConfig.getDefaultConfig();
TikaInputStream stream = TikaInputStream.get(file);
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
tika.getParser().parse(input, handler, metadata, new ParseContext());
String text = handler.toString();