user3125591 user3125591 - 6 months ago 33
Java Question

PDFBox: extracting images from pdf to inputstream

I am using PDFBox to extract the images from my pdf (which contains only jpg's).

Since I will save those images inside my database, I would like to directly convert each image to an inputstream object first without placing the file temporary on my file sysem. I am facing difficulties with this however. I think it has to do because of the use of

image.getPDFStream().createInputStream()
as I did in the following example:

while (imageIter.hasNext()) {
String key = (String) imageIter.next();
PDXObjectImage image = (PDXObjectImage) images.get(key);

FileOutputStream output = new FileOutputStream(new File(
"C:\\Users\\Anton\\Documents\\lol\\test.jpg"));
InputStream is = image.getPDStream().createInputStream(); //this gives me a corrupt file
byte[] buffer = new byte[1024];
while (is.read(buffer) > 0) {
output.write(buffer);
}
}


However this works:

while (iter.hasNext()) {
PDPage page = (PDPage) iter.next();
PDResources resources = page.getResources();
Map<String, PDXObject> images = resources.getXObjects();
if (images != null) {
Iterator<?> imageIter = images.keySet().iterator();
while (imageIter.hasNext()) {
String key = (String) imageIter.next();
PDXObjectImage image = (PDXObjectImage) images.get(key);
image.write2file(new File("C:\\Users\\Anton\\Documents\\lol\\test.jpg")); //this works however
}
}
}


Any idea how I can convert each PDXObjectImage (or any other object I can get) to an inputstream?

Answer

In PDFBox 1.8, the easiest way is to use write2OutputStream(), so your first code block would now look like this:

while (imageIter.hasNext()) {
    String key = (String) imageIter.next();
    PDXObjectImage image = (PDXObjectImage) images.get(key);

    FileOutputStream output = new FileOutputStream(new File(
            "C:\\Users\\Anton\\Documents\\lol\\test.jpg"));
    image.write2OutputStream(output);
}

advanced solution, as long as you're really sure you have only JPEGs that display properly, i.e. have no unusual colorspace:

while (imageIter.hasNext()) {
    String key = (String) imageIter.next();
    PDXObjectImage image = (PDXObjectImage) images.get(key);

    FileOutputStream output = new FileOutputStream(new File(
            "C:\\Users\\Anton\\Documents\\lol\\test.jpg"));
    InputStream is = image.getPDStream().getPartiallyFilteredStream(DCT_FILTERS);
    byte[] buffer = new byte[1024];
    while (is.read(buffer) > 0) {
        output.write(buffer);
    }
}

The second solution removes all filters except the DCT (= JPEG) filter. Some older PDFs have several filters, e.g. ascii85 and DCT.

Now even if you created the image with JPEGs, you don't know what your PDF creation software did. One way to find out what type of image it is, is to check what class it is (use instanceof):

- PDPixelMap => PNG
- PDJpeg => JPEG
- PDCcitt => TIF

Another way is to use image.getSuffix().