Olson Olson - 1 year ago 489
Python Question

How to extract PDF fields from a filled out form in Python?

I'm trying to use Python to processes some PDF forms that were filled out and signed using Adobe Acrobat Reader.

I've tried:

  • The pdfminer demo: it didn't dump any of the filled out data.

  • pyPdf: it maxed a core for 2 minutes when I tried to load the file with PdfFileReader(f) and I just gave up and killed it.

  • Jython and PDFBox: got that working great but the startup time is excessive, I'll just write an external utility in straight Java if that's my only option.

I can keep hunting for libraries and trying them but I'm hoping someone already has an efficient solution for this.

Update: Based on Steven's answer I looked into pdfminer and it did the trick nicely.

from argparse import ArgumentParser
import pickle
import pprint
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdftypes import resolve1, PDFObjRef

def load_form(filename):
"""Load pdf form contents into a nested list of name/value tuples"""
with open(filename, 'rb') as file:
parser = PDFParser(file)
doc = PDFDocument()
return [load_fields(resolve1(f)) for f in

def load_fields(field):
"""Recursively load form fields"""
form = field.get('Kids', None)
if form:
return [load_fields(resolve1(f)) for f in form]
# Some field types, like signatures, need extra resolving
return (field.get('T').decode('utf-16'), resolve1(field.get('V')))

def parse_cli():
"""Load command line arguments"""
parser = ArgumentParser(description='Dump the form contents of a PDF.')
parser.add_argument('file', metavar='pdf_form',
help='PDF Form to dump the contents of')
parser.add_argument('-o', '--out', help='Write output to file',
default=None, metavar='FILE')
parser.add_argument('-p', '--pickle', action='store_true', default=False,
help='Format output for python consumption')
return parser.parse_args()

def main():
args = parse_cli()
form = load_form(args.file)
if args.out:
with open(args.out, 'w') as outfile:
if args.pickle:
pickle.dump(form, outfile)
pp = pprint.PrettyPrinter(indent=2)
if args.pickle:
print pickle.dumps(form)
pp = pprint.PrettyPrinter(indent=2)

if __name__ == '__main__':

Answer Source

You should be able to do it with pdfminer, but it will require some delving into the internals of pdfminer and some knowledge about the pdf format (wrt forms of course, but also about pdf's internal structures like "dictionaries" and "indirect objects").

This example might help you on your way (I think it will work only on simple cases, with no nested fields etc...)

import sys
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1

filename = sys.argv[1]
fp = open(filename, 'rb')

parser = PDFParser(fp)
doc = PDFDocument(parser)
fields = resolve1(doc.catalog['AcroForm'])['Fields']
for i in fields:
    field = resolve1(i)
    name, value = field.get('T'), field.get('V')
    print '{0}: {1}'.format(name, value)

EDIT: forgot to mention: if you need to provide a password, pass it to doc.initialize()