Milan Kocic Milan Kocic - 1 year ago 169
Python Question

PDFminer gives strange letters

I am using python2.7 and PDFminer for extracting text from pdf. I noticed that sometimes PDFminer gives me words with strange letters, but pdf viewers don't. Also for some pdf docs result returned by PDFminer and other pdf viewers are same (strange), but there are docs where pdf viewers can recognize text (copy-paste). Here is example of returned values:

from pdf viewer: ‫فتــح بـــاب ا�ستيــراد البيــ�ض والدجــــاج المجمـــد‬
from PDFMiner: ó

ªéªdG êÉ````LódGh ¢†``«ÑdG OGô``«à°SG ÜÉ
H í``àa

So my question is can I get same result as pdf viewer, and what is wrong with PDFminer. Does it missing encodings I don't know.

Answer Source


This will happen when custom font encodings have been used e.g. identity-H,identity-V, etc. but fonts have not been embedded properly.

pdfminer gives garbage output in such cases because encoding is required to interpret the text

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download