I'm trying to extract text from pdf documents. I've tested several tools like
I can only speak for
PDFTextStream, but in order to understand how it works, you want to understand, roughly, how
PDFTextStream looks at a PDF document.
Each document is made up of
Pages, which are made up of
Blocks (of which there can be many and nested).
Blocks will ultimately contain
Lines, which will contain
Each of these units have an
height property. All a PDF is are these basic units laid out based on their coordinates. When you ask
PDFTextStream to "read" a page, or a region, it looks at the objects and how they are laid out on the X, Y plain and use an approximation of how that would translate to text. This is why you get errors, because there's no 100% foolproof way to turn this structure into machine-readable, structured data.
PDFTextStream, you should look at the
getRegionText function and example. PDFTextStream is proprietary (the reason why I'm moving to PDFBox), so I can't give you details about the algorithms used to fetch the text, but they're based on the above oversimplification.