Aleksey Bykov Aleksey Bykov - 3 years ago 113
Java Question

Apache POI - сonverting *.doc to *.html with images

There is a document .doc that contains some image. How to convert it to *.html, so that the image will remain?

I used the example from this topic - Convert Word doc to HTML programmatically in Java

But the image is lost.
Here is a converter that I use -

public class Converter {
private File docFile;
private File file;

public Converter(File docFile) {
this.docFile = docFile;

public void convert(File file){
this.file = file;

FileInputStream finStream=new FileInputStream(docFile.getAbsolutePath());
HWPFDocument doc=new HWPFDocument(finStream);
WordExtractor wordExtract=new WordExtractor(doc);
Document newDocument = DocumentBuilderFactory.newInstance() .newDocumentBuilder().newDocument();
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(newDocument) ;

StringWriter stringWriter = new StringWriter();
Transformer transformer = TransformerFactory.newInstance()

transformer.setOutputProperty( OutputKeys.INDENT, "yes" );
transformer.setOutputProperty( OutputKeys.ENCODING, "utf-8" );
transformer.setOutputProperty( OutputKeys.METHOD, "html" );

new DOMSource( wordToHtmlConverter.getDocument() ),
new StreamResult( stringWriter ) );

String html = stringWriter.toString();

FileOutputStream fos;
DataOutputStream dos;

try {
BufferedWriter out = new BufferedWriter
(new OutputStreamWriter(new FileOutputStream(file),"UTF-8"));

catch (IOException e) {

JEditorPane editorPane = new JEditorPane();


JScrollPane scrollPane = new JScrollPane(editorPane);
JFrame f = new JFrame("Display Html File");
f.setSize(512, 342);
} catch(Exception e) {

It says here-

"This implementation doesn't create images or links to them. This can be changed by overriding AbstractWordConverter.processImage(Element, boolean, Picture) method"

There are alternatives, or examples of the converters, supporting images?

Answer Source

Your best bet in this case is to use Apache Tika, and let it wrap Apache POI for you. Apache Tika will generate HTML for your document (or plain text, but you want the HTML for your case). Along with that, it'll put in placeholders for embedded resources, img tags for embedded images, and provide you with a way to get at the contents of the embedded resources and images.

There's a very good example of doing this included in Alfresco, HTMLRenderingEngine. You'll likely want to review the code there, then write your own to do something very similar. The code there includes a custom ContentHandler which allows editing of the img tags, to re-write the src attributes, you may or may not need that depending on where you're going to write out the images to.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download