PDFMiner has evolved into a terrific tool. It allows direct control of pdf files at the lowest level, allowng for direct control of the creation of documents and extraction of data. Combined with document writer, recognition, and image manipulation tools as well as a little math magic and the power of commercial tools can be had for all but the most complex tasks. I plan on writing on the use of OCR, Harris corner detection, and contour analysis in OpenCV, homebrew code, and tesseract later.
However, there is little in the way of documentation beyond basic extraction and no python package listing. Basically the methods are discoverable but not listed in full. In fact, existing documentation consists mainly of examples despite the mainy different modules and classes designed to complete a multitude of tasks. The aim of this article, one in a hopefully two part series is to help with extraction of information. The next step is the creation of a pdf document using a tool such as pisa or reportlab since PdfMiner performs extraction.
There are several imports that will nearly alwasy be used for document extraction. All are under the main pdfminer import. The imports can get quite large.
from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument import pdfminer.layout from pdfminer.layout import LAParams,LTTextBox,LTTextLine,LTFigure,LTTextLineHorizontal,LTTextBoxHorizontal from pdfminer.pdfpage import PDFPage from pdfminer.converter import PDFPageAggregator from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
Some imports are meant to perform extraction and others are meant to check for and support the extraction of different types.
[iframe style=”width:120px;height:240px;” marginwidth=”0″ marginheight=”0″ scrolling=”no” frameborder=”0″ src=”//ws-na.amazon-adsystem.com/widgets/q?ServiceVersion=20070822&OneJS=1&Operation=GetAdHtml&MarketPlace=US&source=ac&ref=tf_til&ad_type=product_link&tracking_id=asevans48-20&marketplace=amazon®ion=US&placement=1983154547&asins=1983154547&linkId=d1881738d670f103bc6cb32a486de7a4&show_border=false&link_opens_in_new_window=false&price_color=333333&title_color=0066c0&bg_color=ffffff”]
Visul Representation of Outputs
Imports for Extraction
The following table goes over the imports that perform actions on the document.
|PDFParser||The parser class normally passed to the PDFDocument that helps obtain elements from the PDFDocument.|
|PDFResourceManager||Helps with aggregation. Performs some tasks between the interpreter and device.|
|PDFPageInterpreter||Obtains pdf objects for extraction. With the ResourceManager, it changes page objects to instructions for the device.|
|PDFPageAggregator||Takes in LAParams and PDFResourceManager for getting text from individual pages.|
|PDFDocument||Holds the parser and allows for direct actions to be taken.|
|PDFDevice||Writes instructions to the document.|
|LAParams||Helps with document extraction|
Imports that act as Types
Other imports act as a means of checking against types and utilizing the other classes properties. The PDFDocument contains a variety of pdf objects that which hold their own information. That information includes the type, the coordinates, and the text displayed in the document. Images are also handleable.
The objects include:
These types are useful for pulling information from tables as eplained by Julian Todd.
Creating the Document
Creating the document requires instantiating each of the parts present in the diagram above. The order for setting up the document is to create a parser and then the document with the parser. The resourcemanager and LAParams accepted as arguments by the PageAggregator device used for this task. The PageInterpretor accepts the aggregator and the resource manager. This code is typical of all parsers and as part of the pdf writers.
StringIO will make extraction run more quickly. The resulting object’s code is written in C.
cstr=StringIO() with open(fpath,'rb') as fp: cstr.write(fp.read()) cstr.seek(0) doc=PDFDocument(PDFParser(cstr)) rsrcmgr=PDFResourceManager() laparams=LAParams() device=PDFPageAggregator(rsrcmgr,laparams=laparams) interpreter=PDFPageInterpreter(rsrcmgr,device) for page in PDFPage.create_pages(doc): interpreter.process_page(page) layout=device.get_result() self.parsePage(layout)
The get_result() method adds to the StringIO. The results are passed ot the ParsePage definition. Another method can be used for pure extraction (.get_value()).
The PDF Text Based Objects
The layout received from get_result() parses the strings into separate objects. These objects have several key components. They are the type, the coordinates (startingx, startingy, endingx, endingy), and the content.
Accessing the type can be found using type(object) and compared to the direct type (e.g. type(object)==LTRect). In this instance, a comparison to LTRect returns True.
Getting the Output
Output is obtained and parsed through a series of method calls. The following example shows how to extract content.
tcols= objstack=list(reversed(layout._objs)) tcols= while objstack: b=objstack.pop() print b if type(b) in [LTFigure, LTTextBox, LTTextLine, LTTextBoxHorizontal]: objstack.extend(reversed(b._objs)) elif type(b) == LTTextLineHorizontal: tcols.append(b)
This code takes the object stack as a list, which contains the method pop since python, although having a collections (import collections) package with data structurs such as a set, is highly flexible.
This example is a modification of Julian Todd’s code since I could not find solid documentation for pdfminer. It takes the objects from the layout, reverses them since they are placed in the layout as if it were a stack, and then iterates down the stack, finding anything with text and expanding it or taking text lines and adding them to the list that stores them.
The resulting list (tcols), looks much like other pure extractions that can be performed in a variety of tools including Javas pdfbox, pypdf, and even pdfminer. However, the objects are placed into the bbox (bounding box coordinate list) and the text object accessible from .get_text().
Images are handled using the LTImage type which has a few additional attributes in addition to coordinates and data. The image contains bits, colorspace, height,imagemask,name,srcsize,stream, and width.
Extracting an image works as follows:
if type(b) == LTImage: imgbits=b.bits
PDFMiner only seems to extract jpeg objects. However, xpdf extracts all image.
A more automated and open source solution would be to use subprocess.Popen() to call a java program that extracts images to a specific or provided folder using code such as this (read the full article).
import shlex import subprocess pipe=subprocess.Popen(shlex.split("java -jar myPDFBoxExtractorCompiledJar /home/username/Documents/readable.pdf /home/username/Documents/output.png"),stdout=subprocess.STDOUT) pipe.wait()
Handling the Output
Handling the code is fairly simple and forms the crux of this articles benefits besides combining a variety of resources in a single place.
Just iterate down the stack and pull out the objects as needed. It is possible to form the entire structure using the coordinates. The bounding box method allows for objects to be input in a new data structure as they appear in any pdf document. With some analysis, generic algorithms are posslbe. It may be a good idea to write some code first with the lack of documentation.
The following extracts specific columns of an existing pdf. The bounding box list/array is set up as follows. bbox is the starting x coordinate, bbox is the starting y coordinate, bbox is the ending x coordinate, and bbox is the ending y coordinate.
records,cases,dates,times,types,locations,attorneys=self.convertToDict([[x for x in tcols if float(x.bbox) <= 21.0 and "Name\n" not in x.get_text()],[x for x in tcols if x.bbox=176.0 and "Case\n" not in x.get_text()],[x for x in tcols if x.bbox=257.0 and "Date\n" not in x.get_text()],[x for x in tcols if x.bbox=307.0 and "Time\n" not in x.get_text()],[x for x in tcols if x.bbox=354.0 and "Type\n" not in x.get_text()],[x for x in tcols if x.bbox=607.0 and "Location\n" not in x.get_text()],[x for x in tcols if x.bbox=645.0 and "Attorney\n" not in x.get_text()]])
This code uses python's list comprehension. The reason for the inequalities is that slight differentiations exist in the placement of object. The newline escape character represents an underline in this case.
Pure Text Extraction
In order to see how to perform pure text extraction and move to a better understanding of the code, analyze the following code.
with open(fpath,'rb') as fp: doc=PDFDocument(PDFParser(fp)) rsrcmgr=PDFResourceManager() retstr=StringIO() laparams=LAParams() codec='utf-8' device=TextConverter(rsrcmgr,retstr,codec=codec,laparams=laparams) interpreter=PDFPageInterpreter(rsrcmgr,device) lines="" for page in PDFPage.create_pages(doc): interpreter.process_page(page) rstr=retstr.getvalue() if len(rstr.strip()) >0: lines+="".join(rstr) return lines
PdfMiner is a useful tool that can write and read pdfs and their actual formating. The tool is flexible and can easily control strings. Extracting data is made much easier compared to some full text analysis which can produced garbled and misplaced lines. Not all pdfs are made equal.