John Q Noob John Q Noob - 7 months ago 45
Javascript Question

Programmatically capture the location & content of all embedded text items within a .PDF file?

I have a series of .pdf documents that contain lots of separate instances of embedded text, and I need to be able to loop through each instance and programmatically capture two things: (1) the size and location of the rectangle that outlines each instance of text AND (2) the actual text itself within each of those rectangles.

The goal here is to be able to use JS to automatically insert a button over the top of each text item. I need to be able to give each button a shape/size and location that corresponds (at least roughly) with the existing text rectangle, and I need to name each button the exact text string that is contained within each rectangle.

Can such a thing be done with JS? It seems like it ought to be possible but I definitely don't know enough JS to actually do it.

The .pdf files I am working with are all building floor plans, and each instance of embedded text is a room number for a specific room within each floor, as below:

enter image description here

I have working JS code to create an arbitrary loop of buttons (assuming I had an array of rectangle definitions and text/names available to size, position, and name each one), but I don't know how to programmatically refer to each embedded text item -- neither the size/location of the text rectangle, nor the content of the text itself.

There doesn't seem to be a handy function that will loop through all instances of embedded text within the document, and capture the relevant information, as I could with another object (say, hyperlinks by using the getLinks command).

This is the final step in implementing a larger project, and creating each of these buttons by hand will be impractical since there are multiple thousands of them required by the full set of floor plans.


I have actually been able to accomplish what I needed to do using the getPageNthWord and getPageNthWordQuads commands from the native Acrobat Javascript, and did not need to resort to PDF.js or any other languages or libraries. In my case, it was also necessary to employ the Matrix2D function to translate between the "quad" (for capturing location and dimension of each word) and "rect" (for defining space and location of new buttons) syntax.