Canny Edge Detection in Java

So you don’t really have a decent graphics card, CUDA in C or pyCuda are not options since you don’t have a NVIDIA card, or you just want something completely cross-platform without a large amount of research. Canny edge detection in straight Java does not need to be slow.

Accomplishing a faster and even memory efficient canny edge detection algorithm only requires the use of loops and the proxy design pattern. Basically, simple code applied to the theory will do the trick. All of the code is available at my github repository.

Pre-Processing
The most important initial task is actually to pre-process the image to bring out the edges that are wanted and obscure those that might get caught up in the detection. This is likely standard practice as unwanted edges are still edges. It is possible to implement a series of simple corrections such as a Gaussian or box blur, denoise, or an unsharp mask. The links in the previous sentence were not used in the following blur or my code libraries but the concept remains the same.



Gaussian Blur Example
The gaussian blur is an equation based way of getting rid of unwanted edges. The equation averages pixels within a certain window using Gauss’ statistical normal distribution equation.

This equation is implemented simply in Java by applying a window to an image.

private void blurGauss(double radius) {
		// TODO create a kernel and implement a gaussian blur
		// this is a somewhat lazy implementation setting a definite center and
		// not wrapping to avoid black space

		// the radius of the kernal should be at least 3*the blur factor
		BufferedImage proxyimage=image.getImage();
		
		int rows_and_columns = (int) Math.round(radius);

		while (rows_and_columns % 2 == 0 & rows_and_columns != 0) {
			rows_and_columns++;
		}

		while (rows_and_columns > proxyimage.getWidth()) {
			rows_and_columns = 3;
		}
		int centerx = ((rows_and_columns + 1) / 2) - 1;
		int centery = ((rows_and_columns + 1) / 2) - 1;

		// the kernel sum
		float sum_multiplier = 0;

		/* get the kernel */
		// the base for gaussian filtering
		float base_start = (float) (1 / (2 * Math.PI * Math.pow(radius, 2)));

		// the multiplier matrix to be applied to every pixel, ensured to be one
		float[][] arr = new float[rows_and_columns][rows_and_columns];

		// the central coordinates

		for (int i = 0; i < rows_and_columns; i++) {
			for (int j = 0; j < rows_and_columns; j++) {
				float exp = 0;

				// calculate the corners
				exp = (float) -1.0
						* (float) ((Math.pow((Math.abs(i - centerx)), 2) + (Math
								.pow(Math.abs(j - centery), 2))) / (2 * Math
								.pow(radius, 2)));
				float base = (float) (base_start * Math.exp(exp));
				arr[i][j] = base;

				sum_multiplier += base;
			}

		}

		/* replace the values by multiplying by the sum_multiplier */
		// get the multiplier
		sum_multiplier = (float) 1 / sum_multiplier;

		// multiply by the sum multiplier for each number
		for (int i = 0; i < rows_and_columns; i++) {
			for (int j = 0; j < rows_and_columns; j++) {
				arr[i][j] = arr[i][j] * sum_multiplier;

			}

		}
		// blur the image using the matrix
		complete_gauss(arr, rows_and_columns, centerx, centery);
	}

	private void complete_gauss(float[][] arr, int rows_and_columns,int centerx, int centery) {
		// TODO complete the gaussian blur by applying the kernel for each pixel
		
		BufferedImage proxyimage=image.getImage();
		
		// the blurred image
		BufferedImage image2 = new BufferedImage(proxyimage.getWidth(),proxyimage.getHeight(), BufferedImage.TYPE_INT_RGB);

		// the r,g,b, values
		int r = 0;
		int g = 0;
		int b = 0;

		int i = 0;
		int j = 0;

		// the image height and width
		int width = image2.getWidth();
		int height = image2.getHeight();

		int tempi = 0;
		int tempj = 0;
		int thisx = 0;
		int thisy = 0;
		if (arr.length != 1) {

			for (int x = 0; x < width; x++) {
				for (int y = 0; y < height; y++) {

					// the values surrounding the pixel and the resulting blur
					// multiply pixel and its neighbors by the appropriate
					// ammount

					i = (int) -Math.ceil((double) rows_and_columns / 2);
					j = (int) -Math.ceil((double) rows_and_columns / 2);

					while (i < Math.ceil((double) rows_and_columns / 2)
							& j < Math.ceil((double) rows_and_columns / 2)) {

						// sets the pixel coordinates
						thisx = x + i;

						if (thisx = proxyimage.getWidth()) {
							thisx = 0;
						}

						thisy = y + j;

						if (thisy = proxyimage.getHeight()) {
							thisy = 0;
						}

						// the implementation
						tempi = (int) (Math
								.round(((double) rows_and_columns / 2)) + i);
						tempj = (int) (Math
								.round(((double) rows_and_columns / 2)) + j);

						if (tempi >= arr[0].length) {
							tempi = 0;
						}

						if (tempj >= arr[0].length) {
							tempj = 0;
						}

						r += (new Color(proxyimage.getRGB((thisx), (thisy)))
								.getRed() * arr[tempi][tempj]);
						g += (new Color(proxyimage.getRGB((thisx), (thisy)))
								.getGreen() * arr[tempi][tempj]);
						b += (new Color(proxyimage.getRGB((thisx), (thisy)))
								.getBlue() * arr[tempi][tempj]);

						j++;

						if (j == Math.round((double) rows_and_columns / 2)) {
							j = 0;
							i++;
						}

					}

					// set the new rgb values with a brightening factor
					r = Math.min(
							255,
							Math.max(
									0,
									r
											+ ((int) Math.round(arr[0].length
													* arr[0].length))));
					g = Math.min(
							255,
							Math.max(
									0,
									g
											+ ((int) Math.round(arr[0].length
													* arr[0].length))));
					b = Math.min(
							255,
							Math.max(
									0,
									b
											+ ((int) Math.round(arr[0].length
													* arr[0].length))));

					Color rgb = new Color(r, g, b);
					image2.setRGB(x, y, rgb.getRGB());
					r = 0;
					g = 0;
					b = 0;

					i = 0;
					j = 0;
				}
			}
			image.setImage(image2);
		}
	}

A matrix is generated and then used to blur the image in several loops. Methods were created to make the code more understandable.

Although an “is-a” classification is often associated with interfaces or abstract classes, the proxy design pattern is better implemented with an interface that controls access to the “expensive object.”

Steps to Canny Edge Detection
Canny edge detection takes three steps. These steps prepare the image, mark potential edges, and weed out the best edges.

They are:

  1. Blur with or without denoise and convert to greyscale
  2. Perform an analysis based on threshold values using an intensity gradient
  3. Perform hysteresis

Sobel Based Intensity Gradient
One version of the intensity gradient (read here for more depth on the algorithm) is derived using the Sobel gradient. The gradient is applied in a similar way to the blur, using a window and a matrix.

The matrix finds specific changes in intensity to discover which potential edges are the best candidates. Convolution is performed on the matrix to obtain the best result.

Perform Hysteresis
Hysteresis weeds out the remaining noise from the image, leaving the actual edges. This is necessary in using the Sobel gradient since it finds too many candidates. The trick is to weed out edges from non-edges using threshold values based on the intensity gradient. Values above and below a chosen threshold are thrown out.

A faster way to perform this, if necessary, is to try to use a depth first search-like algorithm to find the ends of the edge, taking connected edges and leaving the rest. This action is fairly accurate.

The Code
Sobel Based Intensity Gradient

private void getIntensity() {
		// TODO calculate magnitude
		/*
		 * Kernels
		 * 
		 * G(x) G(y) -1|0|1 -1|-2|-1 -2|0|2 0|0|0 -1|0|1 1|-2|1
		 * 
		 * |G|(magnitude for each cell)approx. =|G(x)|+|G(y)|=
		 * |(p1+2p2+p3)-(p7+2p8+p9)|+|(p3+2p6+p9)|-|(p1+2p4+p7)|blank rows or
		 * colums are left out of the calc.
		 */

		// the buffered image
		BufferedImage image2 = new BufferedImage(image.getWidth(),
				image.getHeight(), BufferedImage.TYPE_BYTE_GRAY);

		// gives ultimate control can also use image libraries
		// the current position properties
		int x = 0;
		int y = 0;

		// the image width and height properties
		int width = image.getWidth();
		int height = image.getHeight();

		// iterate throught the image
		for (y = 1; y < height - 1; y++) {
			for (x = 1; x < width - 1; x++) { 				// convert to greyscale by masking (32 bit color representing 				// intensity --> reduce to greyscale by taking only set bits)
				// gets the pixels surrounding hte center (the center is always
				// weighted at 0 in the convolution matrix)
				int c1 = (image.getRGB(x - 1, y - 1) & 0xFF);
				int c2 = (image.getRGB(x - 1, y) & 0xFF);
				int c3 = (image.getRGB(x - 1, y + 1) & 0xFF);
				int c4 = (image.getRGB(x, y - 1) & 0xFF);
				int c6 = (image.getRGB(x, y + 1) & 0xFF);
				int c7 = (image.getRGB(x + 1, y - 1) & 0xFF);
				int c8 = (image.getRGB(x + 1, y) & 0xFF);
				int c9 = (image.getRGB(x + 1, y + 1) & 0xFF);

				// apply the magnitude of the convolution kernal (blank
				// column/row not applied)
				// differential x and y gradients are as follows
				// this is non-max suppression
				/*
				 * Lxx = |1,-2,1|*L Lyy= {1,-2,1}*L ({} because its vertical and
				 * not horizontal)
				 */
				int color = Math.abs((c1 + (2 * c2) + c3)
						- (c7 + (2 * c8) + c9))
						+ Math.abs((c3 + (2 * c6) + c9) - (c1 + (2 * c4) + c7));

				// trim to fit the appropriate color pattern
				color = Math.min(255, Math.max(0, color));

				// suppress non-maximum
				// set new pixel of the edge
				image2.setRGB(x, y, color);
			}
		}

		// reset the image
		image = image2;
	}

Hysteresis

private void hysterisis() {
		// TODO perform a non-greedy hysterisis using upper and lower threshold
		// values
		int width = image.getWidth();
		int height = image.getHeight();

		Color curcol = null;
		int r = 0;
		int g = 0;
		int b = 0;

		ve = new String[width][height];

		for (int i = 0; i < width; i++) {
			for (int j = 0; j < height; j++) {
				ve[i][j] = "n";
			}
		}

		for (int i = 0; i < height; i++) {

			for (int j = 0; j < width; j++) { 				curcol = new Color(image.getRGB(j, i)); 				if (ve[j][i].compareTo("n") == 0 						& (((curcol.getRed() + curcol.getBlue() + curcol 								.getGreen()) / 3) > upperthreshold)) {
					ve[j][i] = "c";
					image.setRGB(j, i, new Color(255, 255, 255).getRGB());

					follow_edge(j, i, width, height);
				} else if (ve[j][i].compareTo("n") == 0) {
					ve[j][i] = "v";
					image.setRGB(j, i, new Color(0, 0, 0).getRGB());
				}

			}
		}

	}

Depth First Like Noise Reduction

private void follow_edge(int j, int i, int width, int height) {
		// TODO recursively search edges (memory shouldn't be a problem here
		// since the set is finite and should there should be less steps than
		// number of pixels)

		// search the eight side boxes for a proper edge marking non-edges as
		// visitors, follow any edge with the for-loop acting
		// as the restarter

		int x = j - 1;
		int y = i - 1;
		Color curcol = null;

		for (int k = 0; k < 9; k++) { 			if (x >= 0 & x < width & y >= 0 & y < height & x != j & y != i) { 				curcol = new Color(image.getRGB(j, i)); 				// check color 				if (ve[x][y].compareTo("n") == 0 						& ((curcol.getRed() + curcol.getBlue() + curcol 								.getGreen()) / 3) > lowerthreshold) {
					ve[x][y] = "c";
					image.setRGB(j, i, new Color(255, 255, 255).getRGB());

					follow_edge(x, y, width, height);
				} else if (ve[x][y].compareTo("n") == 0 & x != j & y != i) {
					ve[x][y] = "v";
					image.setRGB(j, i, new Color(0, 0, 0).getRGB());
				}

			}

			// check x and y by k
			if ((k % 3) == 0) {
				x = (j - 1);
				y++;
			}

		}

	}

Laplace Based Intensity Gradient as Opposed to Sobel

The Sobel gradient is not the only method for performing intensity analysis. A Laplacian operator can be used to obtain a different matrix. The Sobel detector is less sensitive to light differences and yields both magnitude and direction but is slightly more complicated. The Laplace gradient may also reduce the need for post-processing as the Sobel gradient normally accepts too many values.

The Laplace gradient uses 0 as a mask value, obtaining the following matrix.

-1 -1 -1
-1 8 -1
-1 -1 -1

The matrix is used to transform each pixel’s RGB value based on whether or not it is part of a candidate edge.

private void find_all_edges() {
		// TODO find all edges using laplace rather than sobel and hysterisis
		// (noise can interfere with the result)
		// the new buffered image containing the edges
		BufferedImage image2 = new BufferedImage(image.getWidth(),
				image.getHeight(), BufferedImage.TYPE_INT_RGB);

		// gives ultimate control can also use image libraries
		// the current position properties
		int x = 0;
		int y = 0;

		// the image width and height properties
		int width = image.getWidth();
		int height = image.getHeight();

		/*
		 * Denoise Using Rewritten Code found at
		 * http://introcs.cs.princeton.edu/
		 * java/31datatype/LaplaceFilter.java.html
		 * 
		 * Using laplace is better than averaging the neighbors from each part
		 * of an image as it does a better job of getting rid of gaussian noise
		 * without overdoing it
		 * 
		 * Applies a default filter:
		 * 
		 * -1|-1|-1 -1|8|-1 -1|-1|-1
		 */

		// perform the laplace for each number
		for (y = 1; y < height - 1; y++) {
			for (x = 1; x < width - 1; x++) {

				// get the neighbor pixels for the transform
				Color c00 = new Color(image.getRGB(x - 1, y - 1));
				Color c01 = new Color(image.getRGB(x - 1, y));
				Color c02 = new Color(image.getRGB(x - 1, y + 1));
				Color c10 = new Color(image.getRGB(x, y - 1));
				Color c11 = new Color(image.getRGB(x, y));
				Color c12 = new Color(image.getRGB(x, y + 1));
				Color c20 = new Color(image.getRGB(x + 1, y - 1));
				Color c21 = new Color(image.getRGB(x + 1, y));
				Color c22 = new Color(image.getRGB(x + 1, y + 1));

				/* apply the matrix */
				// to check, try using gauss jordan

				// apply the transformation for r
				int r = -c00.getRed() - c01.getRed() - c02.getRed()
						+ -c10.getRed() + 8 * c11.getRed() - c12.getRed()
						+ -c20.getRed() - c21.getRed() - c22.getRed();

				// apply the transformation for g
				int g = -c00.getGreen() - c01.getGreen() - c02.getGreen()
						+ -c10.getGreen() + 8 * c11.getGreen() - c12.getGreen()
						+ -c20.getGreen() - c21.getGreen() - c22.getGreen();

				// apply the transformation for b
				int b = -c00.getBlue() - c01.getBlue() - c02.getBlue()
						+ -c10.getBlue() + 8 * c11.getBlue() - c12.getBlue()
						+ -c20.getBlue() - c21.getBlue() - c22.getBlue();

				// set the new rgb values
				r = Math.min(255, Math.max(0, r));
				g = Math.min(255, Math.max(0, g));
				b = Math.min(255, Math.max(0, b));
				Color c = new Color(r, g, b);

				image2.setRGB(x, y, c.getRGB());
			}
		}
		image = image2;
	}

Output

The following before and after pictures show the results of applying the Sobel based matrix to the image and using the depth first search like approach to hysteresis.

Before
cross-country before

After

edge detected

Python Based Edge Detection

In Python, OpenCV performs this operation in a single line.

import cv2

image=cv2.imread(imagepath)
cv2.Canny(image,100,200)
Advertisements

Python PDF 2: Writing and Manipulating a PDF with PyPDF2 and ReportLab

Note: PdfMiner3K is out and uses a nearly identical API to this one. Fully working code examples are available from my Github account with Python 3 examples at CrawlerAids3 and Python 2 at CrawlerAids (both currently developed)

In my previous post on pdfMiner, I wrote on how to extract information from a pdf. For completeness, I will discuss how PyPDF2 and reportlab can be used to write a pdf and manipulate an existing pdf. I am learning as I go here. This is some low hanging fruit meant to provide a fuller picture. Also, I am quite busy.

PyPDF and reportlab do not offer the completeness in extraction that pdfMiner offers. However, they offer a way of writing to existing pdfs and reportlab allows for document creation. For Java, try PDFBox.

However, PyPdf is becoming extinct and pyPDF2 has broken pages on its website. The packages are still available from pip,easy_install, and from github. The mixture of reportlab and pypdf is a bit bizzare.


PyPDF2 Documentation

PyPdf, unlike pdfMiner, is well documented. The author of the original PyPdf also wrote an in depth review with code samples. If you are looking for an in depth manual for use of the tool, it is best to start there.

Report Lab Documentation

Report lab documentation is available to build from the bitbucket repositories.

Installing PyPdf and ReportLab

Pypdf2 and reportlab are easy to install. Additionally, PyPDF2 can be installed from the python package site and reportlab can be cloned.

   easy_install pypdf2
   pip install pypdf2
   
   easy_install reportlab
   pip install reportlab

ReportLab Initialization

The necessary part of report lab is the canvas objects. Report lab has several sizes. They are letter,legal, and portrait. The canvas object is instantiated with a string and size.

from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import portrait

PyPdf2 Initialization

PyPdf2 has a relatively simple setup. A PdfFileWriter is initialized to add to the document, as opposed to the PdfReader which reads from the document. The reader takes a file object as its parameter. The writer takes an output file at write time.

   from pyPDF2 import PdfFileWriter, PdfFileReader
   # a reader
   reader=PdfFileReader(open("fpath",'rb'))
   
   # a writer
   writer=PdfFileWriter()
   outfp=open("outpath",'wb')
   writer.write(outfp)

All of this can be found under the review by the writer of the original pyPdf.

Working with PyPdf2 Pages and Objects

Before writing to a pdf, it is useful to know how to create the structure and add objects. With the PdfFileWriter, it is possible to use the following methods (an IDE or the documentation will give more depth).

  • addBlankPage-create a new page
  • addBookmark-add a bookmark to the pdf
  • addLink- add a link in a specific rectangular area
  • addMetaData- add meta data to the pdf
  • insertPage-adds a page at a specific index
  • insertBlankPage-insert a blank page at a specific index
  • addNamedDestination-add a named destination object to the page
  • addNamedDestinationObject-add a created named destination to the page
  • encrypt-encrypt the pdf (setting use_128bit to True creates 128 bit encryption and False creates 40 bit encryption with a default of 128 bits)
  • removeLinks-removes links by object
  • removeText-removes text by text object
  • setPageMode-set the page mode (e.g. /FullScreen,/UseOutlines,/UseThumbs,/UseNone
  • setPageLayout-set the layout(e.g. /NoLayout,/SinglePage,/OneColumn,/TwoColumnLeft)
  • getPage-get a page by index
  • getLayout-get the layout
  • getPageMode-get the page mode
  • getOutlineRoot-get the root outline

ReportLab Objects

Report lab also contains a set of objects. Documentation can be found here. It appears that postscript or something similar is used for writing documents to a page in report lab. Using ghostscript, it is possible to learn postscript. Postscript is like assembler and involves manipulating a stack to create a page. It was developed at least in part by Adobe Systems, Inc. back in the 1980s and before my time on earth began.

Some canvas methods are:

  • addFont-add a font object
  • addOutlineEntry-add an outline type to the pdf
  • addPostscriptCommand-add postscript to the document
  • addPageLabel-add a page label to the document canvas
  • arc-draw an arc in a postscript like manner
  • beginText-create a text element
  • bezier-create a postscript like bezier curve
  • drawString-draw a string
  • drawText-draw a text object
  • drawPath-darw a postscript like path
  • drawAlignedString-draw a string on a pivot character
  • drawImage-draw an image
  • ellipse-draw an elipse on a bounding box
  • circle-draw a circle
  • rect-draw a rectangle

Write a String to a PDF

There are two things that dominate the way of writing pdf files, writing images, and writing strings to the document. This is handled entirely in

Here, I have added some text and a circle to a pdf.

def writeString():
    fpath="C:/Users/andy/Documents/temp.pdf"
    packet = StringIO.StringIO()
    packet=StringIO.StringIO()
    cv=canvas.Canvas(packet, pagesize=letter)
    
    #create a string
    cv.drawString(0, 500, "Hello World!")
    #a circle. Do not add another string. This draws on a new page.
    cv.circle(50, 250, 20, stroke=1, fill=0)
    
    #save to string
    cv.save()
    
    #get back to 0
    packet.seek(0)
    
    #write to a file
    with open(fpath,'wb') as fp:
        fp.write(packet.getvalue())

The output of the above code:
Page 1
 photo page1.png

Page 2
 photo page2.png

Unfortunately, adding a new element occurs on a new page after calling the canvas’ save method. Luckily the “closure” of the pdf just creates a new page object. A much larger slice of documentation by reportlab goes over writing a document in more detail. The documentation includes alignment and other factors. Alignments are provided when adding an object to a page.

Manipulating a PDF
Manipulation can occur with ReportLab. ReportLab allows for deletion of pages,insertion of pages, and creation of blank pages. The author of pyPDF goes over this in depth in his review.

This code repeats the previous pages twice in a new pdf. It is also possible to merge (overlay) pdf pages.

    from PyPDF2 import PdfFileWriter,PdfFileReader

    pdf1=PdfFileReader(open("C:/Users/andy/Documents/temp.pdf"))
    pdf2=PdfFileReader(open("C:/Users/andy/Documents/temp.pdf"))
    writer = PdfFileWriter()
    
    # add the page to itself
    for i in range(0,pdf1.getNumPages()):
         writer.addPage(pdf1.getPage(i))
    
    for i in range(0,pdf2.getNumPages()):
         writer.addPage(pdf2.getPage(i))
    
    # write to file
    with file("destination.pdf", "wb") as outfp:
        writer.write(outfp)

Overall Feedback
Overall, PyPDF is useful for merging and changing existing documents in terms of the the way they look and reportlab is useful in creating documents from scratch. PyPDF deals mainly with the objects quickly and effectively and reportlab allows for in depth pdf creation. In combination, these tools rival others such as Java’s PdfBox and even exceed it in ways. However, pdfMiner is a better extraction tool.


””

Getting Started Extracting Tables With PDFMiner

PDFMiner has evolved into a terrific tool. It allows direct control of pdf files at the lowest level, allowng for direct control of the creation of documents and extraction of data. Combined with document writer, recognition, and image manipulation tools as well as a little math magic and the power of commercial tools can be had for all but the most complex tasks. I plan on writing on the use of OCR, Harris corner detection, and contour analysis in OpenCV, homebrew code, and tesseract later.

However, there is little in the way of documentation beyond basic extraction and no python package listing. Basically the methods are discoverable but not listed in full. In fact, existing documentation consists mainly of examples despite the mainy different modules and classes designed to complete a multitude of tasks. The aim of this article, one in a hopefully two part series is to help with extraction of information. The next step is the creation of a pdf document using a tool such as pisa or reportlab since PdfMiner performs extraction.

The Imports
There are several imports that will nearly alwasy be used for document extraction. All are under the main pdfminer import. The imports can get quite large.

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
import pdfminer.layout
from pdfminer.layout import LAParams,LTTextBox,LTTextLine,LTFigure,LTTextLineHorizontal,LTTextBoxHorizontal
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter

Some imports are meant to perform extraction and others are meant to check for and support the extraction of different types.



Visul Representation of Outputs

The following image is taken from pdfminer’s limited documentation.

Source: Carleton University

Imports for Extraction
The following table goes over the imports that perform actions on the document.

Import Description
PDFParser The parser class normally passed to the PDFDocument that helps obtain elements from the PDFDocument.
PDFResourceManager Helps with aggregation. Performs some tasks between the interpreter and device.
PDFPageInterpreter Obtains pdf objects for extraction. With the ResourceManager, it changes page objects to instructions for the device.
PDFPageAggregator Takes in LAParams and PDFResourceManager for getting text from individual pages.
PDFDocument Holds the parser and allows for direct actions to be taken.
PDFDevice Writes instructions to the document.
LAParams Helps with document extraction

Imports that act as Types
Other imports act as a means of checking against types and utilizing the other classes properties. The PDFDocument contains a variety of pdf objects that which hold their own information. That information includes the type, the coordinates, and the text displayed in the document. Images are also handleable.

The objects include:

These types are useful for pulling information from tables as eplained by Julian Todd.

Creating the Document
Creating the document requires instantiating each of the parts present in the diagram above. The order for setting up the document is to create a parser and then the document with the parser. The resourcemanager and LAParams accepted as arguments by the PageAggregator device used for this task. The PageInterpretor accepts the aggregator and the resource manager. This code is typical of all parsers and as part of the pdf writers.

StringIO will make extraction run more quickly. The resulting object’s code is written in C.

        cstr=StringIO()
        with open(fpath,'rb') as fp:
            cstr.write(fp.read())
        cstr.seek(0)
        doc=PDFDocument(PDFParser(cstr))
        rsrcmgr=PDFResourceManager()
        laparams=LAParams()
        device=PDFPageAggregator(rsrcmgr,laparams=laparams)
        interpreter=PDFPageInterpreter(rsrcmgr,device)
        
        for page in PDFPage.create_pages(doc):
            interpreter.process_page(page)
        
            layout=device.get_result()
            self.parsePage(layout)

The get_result() method adds to the StringIO. The results are passed ot the ParsePage definition. Another method can be used for pure extraction (.get_value()).

The PDF Text Based Objects
The layout received from get_result() parses the strings into separate objects. These objects have several key components. They are the type, the coordinates (startingx, startingy, endingx, endingy), and the content.

Accessing the type can be found using type(object) and compared to the direct type (e.g. type(object)==LTRect). In this instance, a comparison to LTRect returns True.

Getting the Output
Output is obtained and parsed through a series of method calls. The following example shows how to extract content.

        tcols=[]
        objstack=list(reversed(layout._objs))
    
        tcols=[]
        
        while objstack:
            b=objstack.pop()
            print b
            if type(b) in [LTFigure, LTTextBox, LTTextLine, LTTextBoxHorizontal]:
                objstack.extend(reversed(b._objs)) 
            elif type(b) == LTTextLineHorizontal:
                tcols.append(b)

This code takes the object stack as a list, which contains the method pop since python, although having a collections (import collections) package with data structurs such as a set, is highly flexible.

This example is a modification of Julian Todd’s code since I could not find solid documentation for pdfminer. It takes the objects from the layout, reverses them since they are placed in the layout as if it were a stack, and then iterates down the stack, finding anything with text and expanding it or taking text lines and adding them to the list that stores them.

The resulting list (tcols), looks much like other pure extractions that can be performed in a variety of tools including Javas pdfbox, pypdf, and even pdfminer. However, the objects are placed into the bbox (bounding box coordinate list) and the text object accessible from .get_text().

Images

Images are handled using the LTImage type which has a few additional attributes in addition to coordinates and data. The image contains bits, colorspace, height,imagemask,name,srcsize,stream, and width.

Extracting an image works as follows:

if type(b) == LTImage:
     imgbits=b.bits

PDFMiner only seems to extract jpeg objects. However, xpdf extracts all image.

A more automated and open source solution would be to use subprocess.Popen() to call a java program that extracts images to a specific or provided folder using code such as this (read the full article).

import shlex
import subprocess
pipe=subprocess.Popen(shlex.split("java -jar myPDFBoxExtractorCompiledJar /home/username/Documents/readable.pdf  /home/username/Documents/output.png"),stdout=subprocess.STDOUT) 
pipe.wait()

Handling the Output
Handling the code is fairly simple and forms the crux of this articles benefits besides combining a variety of resources in a single place.

Just iterate down the stack and pull out the objects as needed. It is possible to form the entire structure using the coordinates. The bounding box method allows for objects to be input in a new data structure as they appear in any pdf document. With some analysis, generic algorithms are posslbe. It may be a good idea to write some code first with the lack of documentation.

The following extracts specific columns of an existing pdf. The bounding box list/array is set up as follows. bbox[0] is the starting x coordinate, bbox[1] is the starting y coordinate, bbox[2] is the ending x coordinate, and bbox[3] is the ending y coordinate.

records,cases,dates,times,types,locations,attorneys=self.convertToDict([[x for x in tcols if float(x.bbox[0]) <= 21.0 and "Name\n" not in x.get_text()],[x for x in tcols if x.bbox[0]=176.0 and "Case\n" not in x.get_text()],[x for x in tcols if x.bbox[0]=257.0 and "Date\n" not in x.get_text()],[x for x in tcols if x.bbox[0]=307.0 and "Time\n" not in x.get_text()],[x for x in tcols if x.bbox[0]=354.0 and "Type\n" not in x.get_text()],[x for x in tcols if x.bbox[0]=607.0 and "Location\n" not in x.get_text()],[x for x in tcols if x.bbox[0]=645.0 and "Attorney\n" not in x.get_text()]])

This code uses python's list comprehension. The reason for the inequalities is that slight differentiations exist in the placement of object. The newline escape character represents an underline in this case.

Pure Text Extraction

In order to see how to perform pure text extraction and move to a better understanding of the code, analyze the following code.

       with open(fpath,'rb') as fp:
            doc=PDFDocument(PDFParser(fp))
            rsrcmgr=PDFResourceManager()
            retstr=StringIO()
            laparams=LAParams()
            codec='utf-8'
            device=TextConverter(rsrcmgr,retstr,codec=codec,laparams=laparams)
            interpreter=PDFPageInterpreter(rsrcmgr,device)
            lines=""
            for page in PDFPage.create_pages(doc):
                interpreter.process_page(page)
                rstr=retstr.getvalue()
                
                if len(rstr.strip()) >0:
                    lines+="".join(rstr)
            return lines

Conclusion
PdfMiner is a useful tool that can write and read pdfs and their actual formating. The tool is flexible and can easily control strings. Extracting data is made much easier compared to some full text analysis which can produced garbled and misplaced lines. Not all pdfs are made equal.