Reg Reg - 3 years ago 275
Java Question

How do I reconcile these text positions and line positions with PDFBox?

I am working with a large document, but I have extracted the page giving trouble here. The y-coordinates I get back for the lines in the table seem to be stretched beyond the coordinates of the text. There seems to be some transformation going on, but I cannot find it. If possible I would like to fix the problem within the scope of the PDFGraphicsStreamEngine as extended below, and not have to go back to the drawing board with the other input streams available in PDFBox.

I have extended

PDFTextStripper
to acquire the location of every text glyph on the page:

public class MyPDFTextStripper extends PDFTextStripper {

private List<TextPosition> tps;

public MyPDFTextStripper() throws IOException {
tps = new ArrayList<>();
}

@Override
protected void writeString
(String text,
List<TextPosition> textPositions)
throws IOException {
tps.addAll(textPositions);
}

List<TextPosition> getTps() {
return tps;
}
}


and I have extended
PDFGraphicsStreamEngine
to extract every line on the page as a
Line2D
:

public class LineCatcher extends PDFGraphicsStreamEngine
{
private final GeneralPath linePath = new GeneralPath();
private List<Line2D> lines;

LineCatcher(PDPage page)
{
super(page);
lines = new ArrayList<>();
}

List<Line2D> getLines() {
return lines;
}

@Override
public void strokePath() throws IOException
{
Rectangle2D rect = linePath.getBounds2D();
Line2D line = new Line2D.Double(rect.getX(), rect.getY(),
rect.getX() + rect.getWidth(),
rect.getY() + rect.getHeight());
lines.add(line);
linePath.reset();
}

@Override
public void moveTo(float x, float y) throws IOException
{linePath.moveTo(x, y);}
@Override
public void lineTo(float x, float y) throws IOException
{linePath.lineTo(x, y);}
@Override
public Point2D getCurrentPoint() throws IOException
{return linePath.getCurrentPoint();}

//all other overridden methods can be left empty for the purposes of this problem.
}


I have written a simple program to demonstrate the problem:

public class PageAnalysis {
public static void main(String[] args) {
try (PDDocument doc = PDDocument.load(new File("onePage.pdf"))) {
PDPage page = doc.getPage(0);

MyPDFTextStripper ts = new MyPDFTextStripper();
ts.getText(doc);
List<TextPosition> tps = ts.getTps();

System.out.println("Y coordinates in text:");
Set<Integer> ySet = new HashSet<>();
for (TextPosition tp: tps) {
ySet.add((int)tp.getY());
}
List<Integer> yList = new ArrayList<>(ySet);
Collections.sort(yList);
for (int y: yList){
System.out.print(y + "\t");
}
System.out.println();


System.out.println("Y coordinates in lines:");
LineCatcher lineCatcher = new LineCatcher(page);
lineCatcher.processPage(page);
List<Line2D> lines = lineCatcher.getLines();
ySet = new HashSet<>();
for (Line2D line: lines) {
ySet.add((int)line.getY2());
}
yList = new ArrayList<>(ySet);
Collections.sort(yList);
for (int y: yList){
System.out.print(y + "\t");
}
System.out.println();

} catch (IOException e) {
e.printStackTrace();
}
}
}


The output from this is:

Y coordinates in text:
66 79 106 118 141 153 171 189 207 225 243 261 279 297 315 333 351 370 388 406 424 442 460 478 496 514 780
Y coordinates in lines:
322 340 358 376 394 412 430 448 466 484 502 520 538 556 574 593 611 629 647 665 683 713


The last number in the text list corresponds to the y-coordinate of the page number at the bottom. I cannot find what is going on with the y-coordinates of the lines, though it seems to be those which have been transformed (the media box is the same here as it was for the text, and it fits in with the text positions). The current transformation matrix has 1.0 for yScaling also.

mkl mkl
Answer Source

Indeed, the PDFTextStripper has the bad habit of transforming coordinates into a very un-PDF'ish coordinate system, one with the origin in the upper left of the page and y coordinates increasing downwards.

For a TextPosition tp, therefore, you should not use

tp.getY()

but instead

tp.getTextMatrix().getTranslateY()

Unfortunately these coordinates still may be translated even though they are nearer to the actual PDF default coordinate system, cf. this answer: These coordinates still are transformed to have the origin in the lower left corner of the crop box.

Thus, you really need something like this:

tp.getTextMatrix().getTranslateY() + cropBox.getLowerLeftY()

where cropBox is the PDRectangle retrieved as

PDRectangle cropBox = doc.getPage(n).getCropBox();

where in turn n is the number of the page with that content.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download