WiredC0der WiredC0der - 4 months ago 68
Java Question

Apache PDFBOX - getting java.lang.OutOfMemoryError when using split(PDDocument document)

I am trying to split a document with a decent 300 pages using Apache PDFBOX API V2.0.2.
While trying to split the pdf file to single pages using the following code:

PDDocument document = PDDocument.load(inputFile);
Splitter splitter = new Splitter();
List<PDDocument> splittedDocuments = splitter.split(document); //Exception happens here

I receive the following exception

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

Which indicates that the GC is taking much time to clear the heap that is not justified by the amount reclaimed.

There are numerous JVM tuning methods that can solve the situation, however, all of these are just treating the symptom and not the real issue.

One final note, I am using JDK6, hence using the new java 8 Consumer is not an option in my case.Thanks


This is not a duplicate question of http://stackoverflow.com/questions/37771252/splitting-a-pdf-results-in-very-large-pdf-documents-with-pdfbox-2-0-2 as:

1. I do not have the size problem mentioned in the aforementioned
topic. I am slicing a 270 pages 13.8MB PDF file and after slicing
the size of each slice is an average of 80KB with total size of
2. The Split throws the exception even before it returns the splitted parts.

I found that the split can pass as long as I am not passing the whole document, instead I pass it as "Batches" with 20-30 pages each, which does the job.


PDF Box stores the parts resulted from the split operation as objects of type PDDocument in the heap as objects, which results in heap getting filled fast, and even if you call the close() operation after every round in the loop, still the GC will not be able to reclaim the heap size in the same manner it gets filled.

An option is to split the document split operation to batches, in which each batch is a relatively manageable chunk (10 to 40 pages)

public void execute() {
    File inputFile = new File(path/to/the/file.pdf);
    PDDocument document = null;
    try {
        document = PDDocument.load(inputFile);

        int start = 1;
        int end = 1;
        int batchSize = 50;
        int finalBatchSize = document.getNumberOfPages() % batchSize;
        int noOfBatches = document.getNumberOfPages() / batchSize;
        for (int i = 1; i <= noOfBatches; i++) {
            start = end;
            end = start + batchSize;
            System.out.println("Batch: " + i + " start: " + start + " end: " + end);
            split(document, start, end);
        // handling the remaining
        start = end;
        end += finalBatchSize;
        System.out.println("Final Batch  start: " + start + " end: " + end);
        split(document, start, end);

    } catch (IOException e) {
    } finally {
        //close the document

private void split(PDDocument document, int start, int end) throws IOException {
    List<File> fileList = new ArrayList<File>();
    Splitter splitter = new Splitter();
    List<PDDocument> splittedDocuments = splitter.split(document);
    String outputPath = Config.INSTANCE.getProperty("outputPath");
    PDFTextStripper stripper = new PDFTextStripper();

    for (int index = 0; index < splittedDocuments.size(); index++) {
        String pdfFullPath = document.getDocumentInformation().getTitle() + index + start+ ".pdf";
        PDDocument splittedDocument = splittedDocuments.get(index);