Master Dev Master Dev - 12 days ago 5
PHP Question

PHP Extract data from PDF in array format

I have following pdf file [Marsheet PDF][1] m trying to extract data shown in example, I have tried PDFParse, PDFtoText, etc.... but not working properly is there any solution or example?

<?php
//Output something like this or suggest me if u have any better option
$data_array = array(
array( "name" => "Mr Andrew",
"medicine_name" => "FLUOXETINE 20MG CAPS",
"description" => "TAKE ONE ONCE DAILY FOR LOW MOOD. CAUTION:YOUR DRIVING REACTIONS MAY BE IMPAIRED",
"Dose" => '9000',
"StartDate" => '28/09/15',
"period" => '28',
"Quantity" => '28'
),

array( "name" => "Mr Andrew",
"medicine_name" => "SINEMET PLUS 125MG TAB",
"description" => "TAKE ONE TABLET FIVE TIMES A DAY FOR PD
(8am,11am,2pm,5pm,8pm)
THIS MEDICINE MAY COLOUR THE URINE. THIS IS
HARMLESS. CAUTION:REACTIONS MAY BE IMPAIRED
WHILST DRIVING OR USING TOOLS OR MACHINES.",
"Dose" => '0800,1100,1400,1700,2000',
"StartDate" => '28/09/15',
"period" => '28',
"Quantity" => '140'
), etc...
);
?>

Answer

TL;DR You are almost certainly not going to do this with a library alone.

Why it is not easy

The reason is that PDF files contain typesetting primitives, not extractable text; sometimes the difference is slight enough that you can go by, but usually this means that the document looks slightly wrong, and therefore the "best" PDF generators for text extraction are also the less used.

Some generators exist that embed both the typesetting layer and an invisible text layer, allowing see the beautiful text and have the good text.

Here, you only have the beautiful text, and the grid means that you need for it to be properly typeset.

So, inside, what there actually is to be read is this:

/R8 12 Tf
0.99941 0 0 1 66 765.2 Tm
[(M)2.51003(r)2.805( )-2.16558(A)-3.39556(n)-4.33056(d)-4.33056(r)2.805(e)-4.33056(w)11.5803( )-2.16558(S)-3.39556(m)-7.49588(e)-4.33117(e)556]TJ
ET

and if you assemble the (s)(i)(n)(g)(l)(e) letters inside, you do get "Mr Andrew Smee", but then you need to know where these letters are related to the page, and the data grid. Also you need to beware of spaces. Above, there is one space between Mr and Andrew, but if you removed the spaces and fixed the offsets of the following letters, you would still read "Mr Andrew Smee" and save two characters. Some PDF "optimizers" will try and do that.

And that is why most text extraction libraries, which don't (they use lines, and by and large they don't care about grids) will give you something like

Mr Andrew Smee 505738 12/04/54 (61

or, in the case of "optimized" texts,

MrAndrewSmee50573812/04/54(61

(which still gives the dangerous illusion of being parsable with a regex -- sometimes it is, sometimes it isn't, most of the times it works 95% of the time, so that the remaining 5% turns into a maintenance nightmare from Hell), but, more importantly, they will not be able to get you the content of the medication details timetable divided by cell.

Any information which is space-correlated (e.g. a name has different meanings if it's written in the left "From" or in the right "To" box) will be either lost or variably difficult to reconstruct.

Trying with most libraries, and why it might work (but probably not)

Libraries such as XPDF (and its wrappers phpxpdf, pdf2html, etc.) will give you a simple call such as this

// open PDF
$pdfToText->open('PDF-book.pdf');

// PDF text is now in the $text variable
$text = $pdfToText->getText();
$pdfToText->close();

and your "text" will contain everything, and be something like:

...
START DATE START DAY
WEEK 1 WEEK 2 WEEK 3 WEEK 4
DATE 28 29 30 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
19/10/15
Medication Details
Commencing
D.O.B
Doctor
Hour:Dose 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
Patient
Number
Period
MEDICATION ADMINISTRATION RECORD SHEETS Pharmacy No.
Document No.
02392 731680
28
0900 1
TAKE ONE ONCE DAILY FOR LOW MOOD.
CAUTION:YOUR DRIVING REACTIONS MAY BE IMPAIRED.
28
FLUOXETINE 20MG CAPS
Received Quantity returned quant. by destroyed quant. by

So, reading above, ask yourself - what is that second 28? Can you tell whether it is the received quantity, the returned quantity, the destroyed quantity without looking at the PDF? Sure, if there's only one number, chances are that it will be the received quantity. It becomes a bet.

And is 02392 731680 the document number? It looks like it is (it is not).

Notice also that in the PDF, the medicine name is before the notes. In the extracted text, it is after. By looking at the offsets inside the PDF, you understand why, and it's even a good decision -- but looking at the extracted text, it's not so easy.

So, automatic analysis looks enticingly like it can be done, but as I said, it is a very risky business. It is brittle: someone entering the wrong (for you) text somewhere in the document, sometimes even filling the fields not in sequential order, will result in a PDF which is visually correct and, at the same time, unexplainably unparseable. What are you going to tell your users?

Sometimes, a subset of the available information is stable enough for you to get the work done. In that case, XPDF or PDF2HTML, a bunch of regex, and you're home free in half a day. Yay you! Just keep in mind that any "little" addition to the project might then be impossible. Two numbers are added that are well separated in the PDF; are they 128 and 361, or 12 and 8361, or 1283 and 61? All you get in $text is 128361.

One difficult way to do it, which worked for me

But can you do the same thing "by hand"? After all, by looking at the PDF, you know what you are seeing. And, sure, you can. It just won't be quick and easy. You need:

  • a very low level library (or reading the PDF yourself; you just need to uncompress it first, and there are tools for that, e.g. pdftk). You need to recover the text with coordinates. "C" for "hospitalized" is nothing. "C, 495.2, 882.7" tells you of a hospitalization on October 13th, 2015, and that is the information you are after.
  • patience (or a tool) to input the coordinates of the text zones. You need to tell the system which area is October 13th, 2015... as well as all the other days. For example: // Cell name X1 Y1 X2 Y2 Text [ 'PatientName', 60, 760, 300, 790, '' ], [ 'PatientNumber', 310, 760, 470, 790, '' ], ... [ 'Grid01Y01X01', 90, 1020, 110, 1040, '' ], ...

Note that very many of those values you can calculate programmatically: once you have the top left corner and know one cell's size, the others are more or less calculable with a very slight error. You needn't input yourself six grids of four weeks with six rows each, seven days per week.

You can use the same structure to create a PNG with red areas to indicate which cells you've got covered. That will be useful to visually check you did not forget anything.

At that point you parse the PDF, and every time you find a text at coordinates (x1,y1) you scan all of your cells and determine where the text should be (there are faster ways to do that using XY binary search trees). If you find 'Mr Andrew S' at 66, 765.2 you add it to PatientName. Then you find 'mee' at 109.2, 765.2 and you also add it to PatientName. Which now reads 'Mr Andrew Smee'.

(For very small text there's a slight risk of the letters being output out of order by the PDF driver and corrected through kerning, but usually that's not a problem).

At the end of the whole cycle you will be left with

    [ 'PatientName',    60, 760, 300, 790, 'Mr Andrew Smee' ],
    [ 'PatientNumber',  310, 760, 470, 790, '505738' ],

and so on.

I did this kind of work for a large PDF import project some years back and it worked like a charm. Nowadays, I think most of the heavy lifting could be done with TcLibPDF.

The painful part is recording by hand, the first time, the information for the grid; possibly there might be tools for that, or one could whip up a HTML5/AJAX editor using canvasses.

Comments