I have several PDFs that were generated with Microsoft Word. I want to:
<!-- before: -->
This is the text to match.
<!-- after: -->
This is the text to <a href="http://www.match.com/" target="_blank">match</a>.
addLink()
What you 'want', ('use regex to find matches in PDF') is not possible! Plain and simple answer.
Reasons:
For the general case, you cannot use regexes in order to find 'matches' in a PDF text. And I will not even talk about Unicode characters here...
I'll only consider the simple string of text from the example in your question: match
.
In PDF source code, this string could be present in different incarnations, depending on the PDF generating software as well as on the exact font with font encoding being used. The following listing is not complete!
(match) Tj # you are lucky
<6d61746365> Tj # hex representation of characters
<6d 61 74 63 65> Tj # hex representation of characters, v2
<6d 61 7463 65> Tj # hex representation of characters, v3
<6d>Tj <61> Tj<746365>Tj # hex representation of characters, v4
.... # skipping version 5-500000000 of all...
# ...possible hex representations
(\155\141\164\143\150) Tj # octal representation of characters
(m\141\164ch) Tj # octal/ascii mixed representation of chars
(\155a\164ch) Tj # octal/ascii mixed representation of chars, v3
<6d 61>Tj (\164c\150) Tj # hex/octal/ascii mix
.... # skipping many more possibilities
It gets more complicated even, if the font the string should be using does use a custom encoding (as is the case when the font is embedded in the PDF as a subset -- only containing these glyphs which are used in the respective text).
This could mean that what was <6d61746365> Tj
above could become <2234567111> Tj
with the custom encoded font, but it will still display match
on the PDF page.
You can use pdftotext -layout some.pdf some.txt
to create a file containing the text from your PDF. (This does not work reliably. Some PDFs, for example those which are missing a valid /ToUnicode
table, will not lend themselves readily to text extraction.)
This can lead you to the page number for a match.
Using (with some trial'n'error) pdftotext -f 33 -l 33 -layout -x NN -y MM -W NN -H MM
can narrow down the location of your match on page 33 more exactly.
Using pdftotext -layout -bbox -f 33 -l 33
will return the coordinates of the bounding boxes for each word on page 33.
You could use TET, the Text Extraction Toolkit to find the exact coordinates of matching words too. TET can give you the coordinates of individual glyphs even.
Once you have identified the locations of your matches, you may be able to employ PDFlib to add your links.