Dr. Vick Dr. Vick - 29 days ago 8
Java Question

PDFBox 2.0.3 Set cropBox using TextPosition coordinates

I've located a region of interest in the page by tracking

objects using
as shown in the example: https://github.com/apache/pdfbox/blob/trunk/examples/src/main/java/org/apache/pdfbox/examples/util/PrintTextLocations.java

As shown, the TextPosition has been retrieved from fields like

From this example I tried to keep everything else the same except setting the cropBox of the target page. All the pages end up white even thought the new crop box is set to cover most of the page.


OLD CROPBOX: [0.0,0.0,595.276,841.89] -> NEW CROPBOX [50.0,42.0,592.0,642.0].

So how can I use the
in setting the cropbox correctly ?

The output at the console is :

Connecting Through the HDMI Port | java.awt.Rectangle[x=42,y=54,width=307,height=7]
For an HDMI connection, we recommend one of the following HDMI cable types: | java.awt.Rectangle[x=42,y=79,width=442,height=5]
●High-Speed HDMI Cable | java.awt.Rectangle[x=51,y=96,width=189,height=5]
●High-Speed HDMI Cable with Ethernet | java.awt.Rectangle[x=51,y=119,width=258,height=5]

The original pdf file I'm processing can be downloaded from here: http://downloadcenter.samsung.com/content/UM/201504/20150407095631744/ENG-US_NMATSCJ-1.103-0330.pdf


Later edit containing the code that doesn't work when exporting the excerpt to image.

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDRectangle;
import org.apache.pdfbox.rendering.PDFRenderer;
import org.apache.pdfbox.tools.imageio.ImageIOUtil;

import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;

public class CropBoxImage {
static final String destFolder = "/Users/victor/Downloads/";
static final Float SCALE = 4f;

public static void main(String[] args) throws IOException {
Integer pageNum = 12;

public static void saveToImg(Integer pageNum) throws IOException {
Integer dpi = 300;

String outFilename = String.format("samsung.crop.page.%d.image.jpg", pageNum - 1);
PDDocument ppdDocument = PDDocument.load(new File(destFolder, "Samsung_TV_UserManual_ENG-US_NMATSCJ-1.103-0330.pdf"));
PDPage ppage = ppdDocument.getPage(pageNum - 1);
//commenting out the setCropBox correctly outputs the page to image
ppage.setCropBox(new PDRectangle(40f, 680f, 510f, 100f));

PDFRenderer renderer = new PDFRenderer(ppdDocument);
BufferedImage img = renderer.renderImage(pageNum - 1, SCALE);
ImageIOUtil.writeImage(img, destFolder + outFilename, dpi);

mkl mkl

Cropping the page

In a comment the OP reduced his problem to

Ok. Given a java PDRectangle rect = new PDRectangle(40f, 680f, 510f, 100f) obtained from TextLocation how would a java code snippet, that sets the cropBox of a single page look like ? Or how would you do it? TextLocation based rect --> some transformation --> setCropBox(theRightBox).

To set the crop box of the page twelve of the given document to the given PDRectangle you can use code like this:

PDDocument pdDocument = PDDocument.load(resource);
PDPage page = pdDocument.getPage(12-1);
page.setCropBox(new PDRectangle(40f, 680f, 510f, 100f));
pdDocument.save(new File(RESULT_FOLDER, "ENG-US_NMATSCJ-1.103-0330-page12cropped.pdf"));

(SetCropBox.java test method testSetCropBoxENG_US_NMATSCJ_1_103_0330)

Adobe Reader now shows merely this part of page twelve:


Beware, though, the page in question does not only specify a media box (mandatory) and a crop box, it also defines a bleed box and an art box. Thus, application which consider those boxes more interesting than the crop box, might display the page differently. In particular the art box (being defined as "the extent of the page’s meaningful content") might by some applications be considered important.

Rendering the cropped page

In a comment to this answer the OP remarked

This is good and works. It correctly saves the page in the PDF file. I've tried to do the same in JPG and failed.

I reduced the OP's code to the essentials

PDDocument pdDocument = PDDocument.load(resource);
PDPage page = pdDocument.getPage(12-1);
page.setCropBox(new PDRectangle(40f, 680f, 510f, 100f));

PDFRenderer renderer = new PDFRenderer(pdDocument);
BufferedImage img = renderer.renderImage(12 - 1, 4f);
ImageIOUtil.writeImage(img, new File(RESULT_FOLDER, "ENG-US_NMATSCJ-1.103-0330-page12cropped.jpg").getAbsolutePath(), 300);

(SetCropBox.java test method testSetCropBoxImgENG_US_NMATSCJ_1_103_0330)

The result:

Result image

Thus, I cannot reproduce an issue here.

Possible details to check for:

  • ImageIOUtil is not part of the main PDFBox artifact, instead it is located in pdfbox-tools; does the version of that artifact match the version of the core pdfbox artifact?
  • I run the code in an Oracle Java 8 environment; other Java environments might give rise to different results.
  • There are minor differences in our implementations. E.g. I load the PDF via an InputStream, you directly from file system, I have hardcoded the page number, you have it in some variable, ... None of these differences should cause your problem, but who knows...