Joe Mornin Joe Mornin - 5 months ago 30
Python Question

Add links in PDF

I have several PDFs that were generated with Microsoft Word. I want to:


  1. Use a regex to find matches in the PDF text.

  2. Convert the matching text to a link that points to an external URL.

  3. Save the new version of the PDF.



If I were doing this in HTML, it would look like this:

<!-- before: -->
This is the text to match.

<!-- after: -->
This is the text to <a href="http://www.match.com/" target="_blank">match</a>.


How can I do this to a PDF?

I'd prefer Python, but I'm open to alternatives.

Edit: I don't have access to the original Word documents. I need to manipulate the PDFs themselves. I'm looking for a technique using a Python PDF library (or something similar in another language).

Edit 2: I understand that the source code of a PDF doesn't contain literal strings. I'm wondering if there's an approach that could do something like: (1) extract the text, (2) find matches, and (3) for each match, draw a clickable box around the position of the text in the original PDF. The closest I've come is PyPDF2's
addLink()
, but that adds internal links in the PDF, not links to external URLs.

Answer

1. 'regex' approach won't work!

What you 'want', ('use regex to find matches in PDF') is not possible! Plain and simple answer.

Reasons:

For the general case, you cannot use regexes in order to find 'matches' in a PDF text. And I will not even talk about Unicode characters here...

I'll only consider the simple string of text from the example in your question: match.

In PDF source code, this string could be present in different incarnations, depending on the PDF generating software as well as on the exact font with font encoding being used. The following listing is not complete!

(match) Tj                       # you are lucky
<6d61746365> Tj                  # hex representation of characters
<6d 61 74 63 65> Tj              # hex representation of characters, v2
<6d   61 7463   65> Tj           # hex representation of characters, v3
<6d>Tj <61>   Tj<746365>Tj       # hex representation of characters, v4
....                             # skipping version 5-500000000 of all... 
                                         # ...possible hex representations
(\155\141\164\143\150) Tj        # octal representation of characters
(m\141\164ch) Tj                 # octal/ascii mixed representation of chars
(\155a\164ch) Tj                 # octal/ascii mixed representation of chars, v3
<6d 61>Tj (\164c\150) Tj         # hex/octal/ascii mix
....                             # skipping many more possibilities

It gets more complicated even, if the font the string should be using does use a custom encoding (as is the case when the font is embedded in the PDF as a subset -- only containing these glyphs which are used in the respective text).

This could mean that what was <6d61746365> Tj above could become <2234567111> Tj with the custom encoded font, but it will still display match on the PDF page.


2. Workarounds to achieve similar results may work

  1. You can use pdftotext -layout some.pdf some.txt to create a file containing the text from your PDF. (This does not work reliably. Some PDFs, for example those which are missing a valid /ToUnicode table, will not lend themselves readily to text extraction.)

    This can lead you to the page number for a match.

    Using (with some trial'n'error) pdftotext -f 33 -l 33 -layout -x NN -y MM -W NN -H MM can narrow down the location of your match on page 33 more exactly.

    Using pdftotext -layout -bbox -f 33 -l 33 will return the coordinates of the bounding boxes for each word on page 33.

  2. You could use TET, the Text Extraction Toolkit to find the exact coordinates of matching words too. TET can give you the coordinates of individual glyphs even.

  3. Once you have identified the locations of your matches, you may be able to employ PDFlib to add your links.

Comments