Milan Kocic Milan Kocic - 12 days ago 5
Python Question

PDFminer gives strange letters

I am using python2.7 and PDFminer for extracting text from pdf. I noticed that sometimes PDFminer gives me words with strange letters, but pdf viewers don't. Also for some pdf docs result returned by PDFminer and other pdf viewers are same (strange), but there are docs where pdf viewers can recognize text (copy-paste). Here is example of returned values:

from pdf viewer: ‫فتــح بـــاب ا�ستيــراد البيــ�ض والدجــــاج المجمـــد‬
from PDFMiner: ó

ªéªdG êÉ````LódGh ¢†``«ÑdG OGô``«à°SG ÜÉ
H í``àa

So my question is can I get same result as pdf viewer, and what is wrong with PDFminer. Does it missing encodings I don't know.

Answer

Yes.

This will happen when custom font encodings have been used e.g. identity-H,identity-V, etc. but fonts have not been embedded properly.

pdfminer gives garbage output in such cases because encoding is required to interpret the text

Comments