Tinus Jackson Tinus Jackson - 2 years ago 102
HTML Question

Parsing PDF to HTML using PDF2DOM returns null

I am using pdf2dom and trying their basic documentation. Stated in their documentation - Pdf2Dom is based on the Apache PDFBox™ library.

File file = new File("file.pdf");
PDDocument pdf = PDDocument.load(file);
PDFDomTree parser = new PDFDomTree();
Document dom = parser.createDOM(pdf);

What gets printed out - [#document: null]

Tried the same code with 3 different pdf's

When i strip the same PDF in text it returns the valid text. Thus the file is not null. Am i doing something wrong or the library itself?

Stripper code if it helps.

PDDocument pdf = PDDocument.load(pFile);
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(pd);

Any advice would be appreciated, thanks in advance.

Answer Source

There is nothing wrong in your code. The toString() method of Document object returns that way.

[#document: null], consists of two parts. The first part is #document which is the node name. When you parse XML, you always receive a #document node as the top-level node. The second part is null, which is the value of the node. null is used to indicate that the node doesn't have a value.

If you print dom.getDocumentElement().getTextContent() then you should see some value.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download