Jonas Johansson Jonas Johansson - 3 months ago 22
ASP.NET (C#) Question

Extracting table structured text from PDF-file

I'm extracting information from PDF-file into a string. When coming across text that are structures in the pdf as tables the extracted text is then divided into the way the reader comes across the line and not cell by cell in the table row.

After reading and searching for hours I would like to get some tips on how should i approach this problem to get the string structured in the way shown bellow?

PDF- table structure

Current string:

Difenylmetandiisocyanat 9016-87-9 Acute Tox. 4; H332 >= 10 - < 20
Skin Irrit. 2; H315
Eye Irrit. 2; H319
Resp. Sens. 1; H334
Skin Sens. 1; H317
Carc. 2; H351
STOT SE 3; H335
STOT RE 2; H373
4,4'-metylendifenyldiisocyanat 101-68-8 Acute Tox. 4; H332 >= 10 - < 20
202-966-0 Skin Irrit. 2; H315
Eye Irrit. 2; H319
Resp. Sens. 1; H334
Skin Sens. 1; H317
Carc. 2; H351
STOT SE 3; H335
STOT RE 2; H373


Desired structure:

Difenylmetandiisocyanat

9016-87-9

Acute Tox. 4; H332
Skin Irrit. 2; H315
Eye Irrit. 2; H319
Resp. Sens. 1; H334
Skin Sens. 1; H317
Carc. 2; H351
STOT SE 3; H335
STOT RE 2; H373

>= 10 - < 20

4,4'-metylendifenyldiisocyanat

101-68-8
202-966-0

Acute Tox. 4; H332
Skin Irrit. 2; H315
Eye Irrit. 2; H319
Resp. Sens. 1; H334
Skin Sens. 1; H317
Carc. 2; H351
STOT SE 3; H335
STOT RE 2; H373

>= 10 - < 20

Answer

In your comment you say "There are no tags in the file". However, when I check the file, I clearly see the structure tree:

enter image description here

When a PDF is Tagged, you can easily convert it to XML:

TaggedPdfReaderTool convertor = new TaggedPdfReaderTool();
    convertor.convertToXml(
        new PdfReader("resources/pdfs/sds_w_sv_3.pdf"),
        new FileOutputStream("results/sds_w_sv_3.xml"));

This is a snippet of the resulting XML file:

<Table>
<TR>
<TH>
<Span></Span>
<P>
Best&#229;ndsdelar
 </P>
</TH>
<TH>
<Span></Span>
<P>
CAS
-
nr.
 </P>
</TH>
<TH>
<Span></Span>
<P>
Kontrollparametrar
 </P>
</TH>
<TH>
<Span></Span>
<P>
Grundval
 </P>
</TH>

This XML is an HTML-like structure that allows you to extract the table as a table. However, there must be something wrong with the way the PDF was tagged, because not all the information that is visible in the PDF is rendered to XML.

You can see this when you click on one of the first tags:

enter image description here

The content of the first <P> (paragraph) in the structure tree is AVSNITT 1 on page 40. What happened to the tags of the first 39 pages? This is a bad PDF file. It says that it's tagged, but at first sight it isn't properly tagged. You should ask the person who produced this file to properly tag it. Without proper tags, you will have a hard time finding a table-like structure programmatically.