DuckPuncher DuckPuncher - 10 months ago 126
Python Question

Extracting text from a PDF file using PDFMiner in python?

Python Version 2.7

I am looking for documentation or examples on how to extract text from a PDF file using PDFMiner with Python.

It looks like PDFMiner updated their API and all the relevant examples I have found contain outdated code(classes and methods have changed). The libraries I have found that make the task of extracting text from a PDF file easier are using the old PDFMiner syntax so I'm not sure how to do this.

As it is, I'm just looking at source-code to see if I can figure it out.

Answer Source

Here is a working example of extracting text from a PDF file using the current version of PDFMiner(September 2016)

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):

    text = retstr.getvalue()

    return text

PDFMiner's structure changed recently, so this should work for extracting text from the PDF files.