JavaCV Basics: Basic Image Processing

Here, we analyze some of the basic image processing tools in OpenCV and their use in GoatImage.

All code is available on GitHub under GoatImage. To fully understand this article, read the related articles and look at this code.

Select functions are exemplified here. In GoatImage, JavaDocs can be generated further explaining the functions. The functions explained are:

  • Sharpen
  • Contrast
  • Blur

Dilate, rotate, erode, min thresholding, and max thresholding are left to the code. Thresholding in OpenCV is described in depth with graphs and charts via the documentation.

Related Articles:

Basic Processing in Computer Vision

Basic processing is the key to successful recognition. Training sets come in a specific form. Pre-processing is usually required to ensure the accuracy and quality of a program. JavaCV and OpenCV are fast enough to work in a variety of circumstances to improve algorithmic performance at a much lower speed reduction cost. Each transform applied to an image takes time and memory but will pay off handsomely if done correctly.

Kernel Processing

Most of these functions are linear transformations. A linear transformation uses a function to map one matrix to another (Ax = b). In image processing, the matrix kernel is used to do this. Basically a weighted matrix can be used to map a certain point or pixel value.

For an overview of image processing kernels, see wikipedia.

Kernels may be generated in JavaCV.

    * Create a kernel from a double array (write large kernels more understandably)
    * @param kernelArray      The double array of doubles with the kernel values as signed ints
    * @return                 The kernel mat
  def generateKernel(kernelArray: Array[Array[Int]]):Mat={
    val m = if(kernelArray != null) kernelArray.length else 0
    if(m == 0 ){
      throw new IllegalStateException("Your Kernel Array Must be Initialized with values")

    if(kernelArray(0).length != m){
      throw new IllegalStateException("Your Kernel Array Must be Square and not sparse.")

    val kernel = new Mat(m,m,CV_32F,new Scalar(0))
    val ki = kernel.createIndexer().asInstanceOf[FloatIndexer]

    for(i <- 0 until m){
      for(j <- 0 until m){

More reliably, there is a function for generating a Gaussian Kernel.

    * Generate the square gaussian kernel. I think the pattern is a(0,0)=1 a(1,0) = n a(2,0) = n+2i with rows as a(2,1) = a(2,0) * n and adding two to the middle then subtracting.
    * However, there were only two examples on the page I found so do not use that without verification.
    * @param kernelMN    The m and n for our kernel matrix
    * @param sigma       The sigma to multiply by (kernel standard deviation)
    * @return            The resulting kernel matrix
  def generateGaussianKernel(kernelMN : Int, sigma : Double):Mat={

Sharpen with A Cutom Kernel

Applying a kernel in OpenCV can be done with the filter2D method.


Here a sharpening kernel using the function above is applied.

    * Sharpen an image with a standard sharpening kernel.
    * @param image    The image to sharpen
    * @return         A new and sharper image
  def sharpen(image : Image):Image={
    val srcMat = new Mat(image.image)
    val outMat = new Mat(srcMat.rows(),srcMat.cols(),srcMat.`type`())

    val karr : Array[Array[Int]] = Array[Array[Int]](Array(0,-1,0),Array(-1,5,-1),Array(0,-1,0))
    val kernel : Mat = this.generateKernel(karr)
    new Image(new IplImage(outMat),,image.itype)


Contrast kicks up the color intensity in images by equation, equalization, or based on neighboring pixels.

One form of Contrast applies a direct function to an image:

    * Use an equation applied to the pixels to increase contrast. It appears that
    * the majority of the effect occurs from converting back and forth with a very
    * minor impact for the values. However, the impact is softer than with equalizing
    * histograms. Try sharpen as well. The kernel kicks up contrast around edges.
    * (maxIntensity/phi)*(x/(maxIntensity/theta))**0.5
    * @param image                The image to use
    * @param maxIntensity         The maximum intensity (numerator)
    * @param phi                  Phi value to use
    * @param theta                Theta value to use
    * @return
  def contrast(image : Image, maxIntensity : Double, phi : Double = 0.5, theta : Double = 0.5):Image={
    val srcMat = new Mat(image.image)
    val outMat = new Mat(srcMat.rows(),srcMat.cols(),srcMat.`type`())

    val usrcMat = new Mat()
    val dest = new Mat(srcMat.rows(),srcMat.cols(),usrcMat.`type`())

    multiply(dest,(maxIntensity / phi))
    val fm = 1 / Math.pow(maxIntensity / theta,0.5)
    multiply(dest, fm)

    new Image(new IplImage(outMat),,image.itype)

Here the image is manipulated using matrix equations to form a new image where pixel intensities are improved for clarity.

Another form of contrast equalizes the image histogram:

* A form of contrast based around equalizing image histograms.
* @param image The image to equalize
* @return A new Image
def equalizeHistogram(image : Image):Image={
val srcMat = new Mat(image.image)
val outMat = new Mat(srcMat.rows(),srcMat.cols(),srcMat.`type`())
new Image(new IplImage(outMat),,image.itype)

The JavaCV method equalizeHist is used here.


Blurring uses averaging to dull images.

Gaussian blurring uses a Gaussian derived kernel to blur. This kernel uses an averaging function as opposed to equal weighting of neighboring pixels.

    * Perform a Gaussian blur. The larger the kernel the more blurred the image will be.
    * @param image              The image to use
    * @param degree             Strength of the blur
    * @param kernelMN           The kernel height and width should match (for instance 5x5)
    * @param sigma              The sigma to use in generating the matrix
    * @param depth              The depth to use
    * @param brightenFactor     A factor to brighten the result by with  0){
      outImage = this.brighten(outImage,brightenFactor)

A box blur uses a straight kernel to blur, often weighting pixels equally.

    * Perform a box blur and return a new Image. Increasing the factor has a significant impact.
    * This algorithm tends to be overly powerful. It wiped the lines out of my test image.
    * @param image   The Image object
    * @param depth   The depth to use with -1 as default corresponding to image.depth
    * @return        A new Image
  def boxBlur(image : Image,factor: Int = 1,depth : Int = -1):Image={
    val srcMat = new Mat(image.image)
    val outMat = new Mat(srcMat.rows(),srcMat.cols(),srcMat.`type`())

    //build kernel
    val kernel : Mat = this.generateKernel(Array(Array(factor,factor,factor),Array(factor,factor,factor),Array(factor,factor,factor)))

    //apply kernel
    filter2D(srcMat,outMat, depth, kernel)

    new Image(new IplImage(outMat),,image.itype)

Unsharp Masking

Once a blurred Mat is achieved, it is possible to perform an unsharp mask. The unsharp mask brings out certain features by subtracting the blurred image from the original while taking into account an aditional factor.

def unsharpMask(image : Image, kernelMN : Int = 3, sigma : Double = 60,alpha : Double = 1.5, beta : Double= -0.5,gamma : Double = 2.0,brightenFactor : Int = 0):Image={
    val srcMat : Mat = new Mat(image.image)
    val outMat = new Mat(srcMat.rows(),srcMat.cols(),srcMat.`type`())
    val retMat = new Mat(srcMat.rows(),srcMat.cols(),srcMat.`type`())

    //using htese methods allows the matrix kernel size to grow
    GaussianBlur(srcMat,outMat,new Size(kernelMN,kernelMN),sigma)

    var outImage : Image = new Image(new IplImage(outMat),,image.itype)

    if(brightenFactor > 0){
      outImage = this.brighten(outImage,brightenFactor)



This article examined various image processing techniques.

JavaCV Basics: Splitting Objects

Here we put together functions from previous articles to describe a use case where objects are discovered in an image and rotated.

All code is available on GitHub under the GoatImage project.

Related Articles:

Why Split Objects

At times, objects need to be tracked reliably, OCR needs to be broken down to more manageable tasks, or there is another task requiring splitting and rotation. Particularly, recognition and other forms of statistical computing benefit from such standardization.

Splitting allows object by object recognition which may or may not improve accuracy depending on the data used to train an algorithm and even the type of algorithm used. Bayesian based networks, including RNNs, benefit from this task significantly.

Splitting and Rotating

The following function in GoatImage performs contouring to find objects, creates minimum area rect, and finally rotates objects based on their skew angle.

    * Split an image using an existing contouring function. Take each RIO, rotate, and return new Images with the original,
    * @param image              The image to split objects from
    * @param contourType        The contour type to use defaulting to CV_RETR_EXTERNAL
    * @param minBoxArea         Minumum box area to accept (-1 means everything and is default)
    * @param maxBoxArea         Maximum box area to accept (-1 means everything and is default)
    * @param show               Whether or not to show the image. Default is false.
    * @param xPosSort           Whether or not to sort the objects by their x position. Default is true. This is faster than a full sort
    * @return                   A tuple with the original Image and a List of split out Image objects named by the original_itemNumber
  def splitObjects(image : Image, contourType : Int=  CV_RETR_LIST,minBoxArea : Int = -1, maxBoxArea : Int = -1, show : Boolean= false,xPosSort : Boolean = true):(Image,List[(Image,BoundingBox)])={
    val imTup : (Image, List[BoundingBox]) = this.contour(image,contourType)

    var imObjs : List[(Image,BoundingBox)] = List[(Image,BoundingBox)]()

    var boxes : List[BoundingBox] = imTup._2

    //ensure that the boxes are sorted by x position
      boxes = boxes.sortBy(_.x1)

    if(minBoxArea > 0){
        boxes = boxes.filter({x => (x.width * x.height) > minBoxArea})

    if(maxBoxArea > 0){
      boxes = boxes.filter({x => (x.width * x.height) < maxBoxArea})

    //get and rotate objects
    var idx : Int = 0
    for(box <-  boxes){
      val im = this.rotateImage(box.image,box.skewAngle)
        im.showImage(s"My Box ${idx}")
      imObjs = imObjs :+ (im,box)
      idx += 1


Contours are filtered after sorting if desired. For each box, rotation is performed and the resulting image returned as a new Image.


Here the splitObjects function of GoatImage is reviewed, revealing how the library and OpenCV splits and rotates objects as part of standardization for object recognition and OCR.

JavaCV Basics: Cropping

The ROI code is  broken on the JavaCV example site. Here we will look at cropping an image by defining a region of interest. The remaining JavaCV example code should work.

All code is available on GitHub under the GoatImage project.

Related Articles:

Defining an ROI

Setting a Region of Interest (ROI) requires using the cvSetImageROI function which takes an IplImages and a Rect representing the region of interest.

cvSetImageROI(image, rect)

Putting it all Together By Cropping

Cropping takes our ROI and generates a new image fairly directly.

    * Crop an existing image.
    * @param image      The image to crop
    * @param x          The starting x coordinate
    * @param y          The starting y coordinate
    * @param width      The width
    * @param height     The height
    * @return           A new Image
  def crop(image : Image, x : Int, y : Int, width : Int, height : Int): Image={
    val rect = new CvRect(x,y,width,height)
    val uImage : IplImage = image.image.clone()
    cvSetImageROI(uImage, rect)
    new Image(cvCreateImage(cvGetSize(uImage),image.image.depth(),image.image.nChannels()),,image.itype)


Simple cropping was introduced to rectify an issue with the ROI example from JavaCV.

JavaCV Basics: Rotating

Rotating an image is a common task. This article reviews how to rotate a matrix using JavaCV.

These tutorials utilize GoatImage. The Image object used in the code examples comes from this library.

Related Articles:

Rotation Matrix

The rotation matrix  is used to map from one pixel position to another. The matrix, shown below, uses trigonometric functions.


Rotation is a linear transformation. A linear transformation uses a function to map from one matrix to another. In image processing, the matrix kernel is used to perform this mapping.

Rotation in JavaCV 3

Rotation in JavaCV 3 utilizes a generated rotation matrix and the warp affine function. The function getRotationMatrix2D generates a two dimensional matrix using a center Point2f, angle, and a scale.

    * Rotate an image by a specified angle using an affine transformation.
    * @param image      The image to rotate
    * @param angle      The angle to rotate by
    * @return           A rotated Image
  def rotateImage(image : Image,angle : Double):Image={
    val srcMat = new Mat(image.image)
    val outMat = new Mat(srcMat.cols(),srcMat.rows(),srcMat.`type`())
    val cosv = Math.cos(angle)
    val sinv = Math.sin(angle)
    val width = image.image.width
    val height = image.image.height
    val cx = width/2
    val cy = height/2

    //(image.image.width*cosv + image.image.height*sinv, image.image.width*sinv + image.image.height*cosv);
    val rotMat : Mat = getRotationMatrix2D(new Point2f(cx.toInt,cy.toInt),angle,1)

    new Image(new IplImage(outMat),,image.itype)

The Angle

The angle in OpenCV and thus JavaCV is centered at -45 degrees due to the use of vertical contouring. If an image is less than -45 degrees, adding 90 degrees will correct this offset.

val angle = if(minAreaRect.angle < -45.0) minAreaRect.angle + 90 else minAreaRect.angle


In this tutorial we reviewed the function in GoatImage for rotating images using OpenCv. The functions getRotationMatrix2D and warpAffine were introduced. Basic kernel processing was introduced as well.

JavaCV Basics: Contouring

Here, we look at contouring in JavaCV 3. Contouring discovers the boundaries in an image that stand out from the background. It is a key part of object tracking, rotation, and many other tasks. JavaCV and OpenCV allow for the creation of bounding boxes around objects discovered through contouring.

The tutorials utilizes GoatImage.  The Image object is from this library.

All code is available on GitHub under GoatImage.

Related Articles:


Remember contouring in Calc3. The same principal is used in image processing. Contour lines have a constant value. In imaging the value can be of a certain intensity or color. This differs from contouring a shape which uses values such as those obtained from the derivative of an equation.



JavaCV includes functions for finding contours. The MatVector is used to store the discovered contour lines. MatVector in JavaCV is used in place of the CvSeq from OpenCV.

 val chainType : Int = CHAIN_APPROX_SIMPLE //chaining explained below
 val contourType : Int = CV_RETR_LIST //return values described below 
 val contours = new MatVector()
 val hierarchy = new Mat()
 val contImage : IplImage = image.image.clone()
 findContours(new Mat(image.image.clone()), contours,contourType, chainType, new Point(0, 0))

Contour lines can be stored in the MatVector in several forms. These options are:

  • CV_RETR_EXTERNAL : Returns only external contours
  • CV_RETR_LIST : Returns a list of every contour including nested contours
  • CV_RETR_CCOMP : Returns contours organized by inner and outer contours
  • CV_RETR_TREE : Returns a hierarchical ordering of contours nested by tree level

Similarly, several forms of chaining are available with varying effects. Chaining defines the level of approximation used in estimating the points forming the contour line. Approximation types include:

  • CHAIN_APPROX_NONE : Store every point
  • CHAIN_APPROX_SIMPLE : Encode values by storing only the endpoints of an interval and coefficients for the line through the endpoints
  • CHAIN_APPROX_TC89_L1 : Use a variant of the Ten Chin algorithm
  • CHAIN_APPROX_TC89_KCOS : Another variant of Ten Chin

A paper is available describing the Ten Chin algorithm.

Draw Image Contours

JavaCV contains a method for drawing your contours. Specifically, drawContours may be used from opencv_imgproc.

    * A test function that draws the contours found in the Image.
    * @param image
    * @param contourType
    * @return
  def drawImageContours(image : Image, contourType : Int = CV_RETR_LIST):(Image,Image,Long)={
    val dstMat : Mat = new Mat(image.image.clone())
    val srcMat = new Mat(image.image)
    val storage : CvMemStorage =cvCreateMemStorage(0)
    val contours = new MatVector()
    val hierarchy = new Mat()
    findContours(new Mat(image.image.clone()), contours,contourType, CHAIN_APPROX_SIMPLE, new Point(0, 0))
    val colorDst = new Mat(srcMat.size(), CV_8UC3, new Scalar(0))
    drawContours(colorDst, contours, -1  , new Scalar(255,0,0,0))
    (image,new Image(new IplImage(colorDst),,image.itype),contours.size())

Bounding Boxes and Min Area Rectangles

In this tutorial, bounding boxes and minimum area rectangles are obtained through JavaCV specific functions.

The bounding box is an upright Rect. Rect is a JavaCV variant of the cvRect. The bottom left coordinate, width, height, and position are stored in the box. The center is calculable as x + width/2, y + height /2.

val idx : Int  = 0

The minimum bounding rectangle is a Rotated Rect which stores points relative to the actual bounding rectangle. This object has angle and point attributes in addition to relative x and y values and width and height.

val idx : Int = 0

To get a bounding circle, use the circle function ported from OpenCV.

Putting it all Together

The following code obtains contours and a list of bounding boxes.

    * Contour and return the bounding boxes and the slope of the found box. Boxes that overlap
    * will be combined to make a much bigger box.
    * @param image            The image to use in detection.
    * @param contourType      The type of contour to use by default grab only the external RETR_CCOMP returns and organizes inner and outer contours & RETR_LIST gives everything
    * @return           A tuple of the image and a list of tuples with bounding boxes and their bottom line slopes
  def contour(image : Image, contourType : Int =  CV_RETR_LIST):(Image,List[BoundingBox])={
    var minRects : List[BoundingBox] = List[BoundingBox]()
    val total : Int = 0
    val storage : CvMemStorage =cvCreateMemStorage(0)
    val contours = new MatVector()
    val hierarchy = new Mat()
    val contImage : IplImage = image.image.clone()
    findContours(new Mat(image.image.clone()), contours,contourType, CHAIN_APPROX_SIMPLE, new Point(0, 0))

    for(idx <- 0 until contours.size().toInt) {
      val clMat = new Mat()
      new Mat(image.image).copyTo(clMat)
      val clImage : IplImage = new IplImage(clMat)
      val boundRect = minAreaRect(contours.get(idx))
      val bbx = boundRect.boundingRect()
      val cropped : IplImage = cvCreateImage(cvGetSize(clImage),clImage.depth(),clImage.nChannels())
      val angle : Double = if(boundRect.angle() < -45.0) boundRect.angle + 90 else boundRect.angle()
      minRects = minRects :+ new BoundingBox(boundRect.boundingRect().x,boundRect.boundingRect().y,boundRect.boundingRect().width,boundRect.boundingRect().height,angle,new Image(cropped,s"${}_${idx}",image.itype))

    //setup bounding boxes


Here, we looked at contouring in JavaCV. JavaCV functions were used to find contours, draw them, and obtain bounding boxes.

Java CV Basics: IplImage

Java CV  documentation is not always useful or is already out of date. To rectify this, I am creating some basic tutorials with links to the JavaDocs. This article explores loading and saving images and the basics behind an IplImage. These tutorials are written in Java and Scala.

These tutorials utilize GoatImage. The Image object is from this library.

Related Articles:

Why Java and Scala

Java and Scala are popular and make up a huge portion of data science libraries that exist today. Tools such as Akka allow for the stream processing of images as well.

Java CV and OpenCV

Java CV is a set of bindings for the C classes in OpenCV. Method calls are similar to their OpenCV equivalents.


The JNI is used to allow access to the OpenCV api in Java.

The Ipl Image

The IplImage stores image bytes, channel numbers, sizes and other bits of useful information. An IPL image exists at org.bytedeco.javacpp.opencv_core.IplImage.

Load an Ipl Image

Preferably, an IplImage is loaded from a file.

import org.bytedeco.javacpp.opencv_imgcodecs.cvLoadImage
val impl : IplImage = cvLoadImage(fpath.getAbsolutePath)

The IplImage may be loaded from a matrix or mat, from a byte pointer, or may be specified later by using the no option constructor or the size of the raster as a Long.

import org.bytedeco.javacpp.BytePointer
import org.bytedeco.javacpp.opencv_core.IplImage

val bp : BytePointer = new BytePointer(ByteBuffer.wrap(bytes))
this.image = new IplImage(bp)

The BytePointer is the native pointer to an underlying C based char array. A BytePointer is loaded using a ByteBuffer. This type of Pointer can be converted to a ByteBuffer using asBuffer().

Unfortunately, the BytePointer method is not always reliable. An alternate way to load an IplImage would be to use a BufferedImage.

val m : Mat = new Mat(image.getHeight,image.getWidth,image.getType,new BytePointer(ByteBuffer.wrap(image.getRaster.getDataBuffer.asInstanceOf[DataBufferByte].getData)))
new Image(new IplImage(m),name,itype)

Here, the Mat class is used to create a new image matrix.

Image Matrix

The image matrix, Mat, located under the opencv_core is a data structure wrapped around a pointer to an array of bytes with related methods including a set of mathematical operations.

Cross products, dot products, inverse, and the ability to divide values, a non-linear operation, are all available.

import org.bytedeco.javacpp.opencv_core.cvLoad
import org.bytedeco.javacpp.opencv_core.{Mat,IplImage}

new IplImage(new Mat(cvLoad(fpath.getAbsolutePath)))

The conversion process is shown above in addition to how to load the Mat.

Load an Image

Loading an image or image matrix is possible using the methods from org.bytedeco.javacpp.opencv_imgcodecs. The methods take a BytePointer object or file name.

import org.bytedeco.javacpp.opencv_imgcodecs._

path : File = new File("out/image.jpg")

mpath : File =new File("out/imageMat.mat")

Save an IplImage

The save functions exist in the opencv_core and imagecodecs libraries. Most are static as direct wrappers around the OpenCV library.  They can be imported directly.

import org.bytedeco.javacpp.opencv_core._

directory : File = new File("out/")
cvSave(new File(directory.getAbsolutePath,,myIplImage)

Constructors for cvSave include (filename : String, image : IplImage), and (filename : String, struct_ptr : Pointer).

Useful Functions

Some useful functions exist for the base IplImage. These include:

  • imageData() – Get the BytePointer froom the IplImage. The resulting structure opens asBuffer()
  • width() – Get the image width
  • height() – Get the image height
  • nchannels() – Get the number of channels with 3 corresponding to RGB and 4 to RGBA

Create an Image

At times, it is necessary to create an image from scratch. This is useful to avoid overwriting information. This requires using the cvCreateImage function. This function from opencv_core.IplImage takes the width, height, depth, and channels as arguments.

val src : IplImage = ....
val dst : IplImage = cvCreateImage(cvSize((src.width() * scale).toInt,(src.height * scale).toInt),src.depth,src.nChannels)


Here, we discussed the basic image classes and packages in OpenCV. Some useful functions were mentioned.

Headless Testing and Scraping with Java FX

There is a lot of JavaScript in the world today and there is a need to get things moving quickly. Whether testing multiple websites or acquiring data for ETL and/or analysis, a tool needs to exist that does not leak memory as much as Selenium. Until recently, Selenium was really the only option for webkit, JCEF and writing native bindings for Chromium have been options for a while. Java 7 and Java 8 have stepped into the void with the JavaFX tools. These tools can be used to automate scraping and testing where network calls for HTML, Json, CSVs, pdfs, or what not are more tedious and difficult.

The FX Package

FX is much better than the television channel with some exceptions. Java created a sleeker version of Chromium based on webkit. While webkit suffers from some serious setbacks, Java FX also incorporates nearly any part of the framework. Setting SSL Handlers, proxies, and the like works the same as with Therefore, FX can be used to intercept traffic (e.g. directly stream images that are incoming to a file named by URL without making more network calls), present a nifty front end controlled by JavaScript and querying for components,


Ui4j is as equally nifty as the FX package. While FX is not capable of going headless without a lot of work, Ui4j takes the work out of such a project using Monocle or Xvfb. Unfortunately, there are some issues getting Monocle to run by setting -Dui4j.headless=true on command line or using system properties after jdk1.8.0_20. Oracle removed Monocle from the jdk after this release and forced the programs using the server to OpenMonocle. However, xvfb-run -a works equally well. The -a option automatically chooses a server number. The github site does claim compatibility with Monocle though.

On top of headless mode, the authors have made working with FX simple. Run JavaScript as needed, incorporate interceptors with ease, run javascript, and avoid nasty waitFor calls and Selanese (this is an entire language within your existing language).


There is an alternative to Ui4j in TestFX. It is geared towards testing. Rather than using an Assert after calling or with ((String) page.executeScript(“document.documentElement.innerHTML”)), methods such as verifyThat exist. Combine with Scala and have a wonderfully compact day. The authors have also managed to get a workaround for the Monocle problem.

Multiple Proxies

The only negative side effect of FX is that multiple instances must be run to use multiple proxies. Java and Scala for that matter set one proxy per JVM. Luckily, both Java and Scala have subprocess modules. The lovely data friendly language that is Scala makes this task as simple as Process(“java -jar myjar.jar -p my:proxy”).!. Simply run the command which returns the exit status and blocks until complete (see Futures to make this a better version of non-blocking) and use tools like Scopt to get the proxy and set it in a new Browser session. Better yet, take a look at my Scala macros article for some tips on loading code from a file (please don’t pass it as command line). RMI would probably be a bit better for large code but it may be possible to better secure a file than compiled code using checksums.


Throw out Selenium, get rid of the extra Selanese parsing and get Ui4J or TestFX for webkit testing. Sadly, it does not work with Gecko so Chromium is needed to replace these tests and obtain such terrific options as –ignore-certificate-errors. There are cases where fonts in the SSL will wreak havoc before you can even handle the incoming text no matter how low level you write your connections. For simple page pulls, stick to Apache HTTP Components which contains a fairly fast, somewhat mid-tier RAM usage asynchronous thread pool useable in Java or Scala. Sorry for the brevity folks but I tried to answer a question or two that was not in tutorials or documentation. Busy!

Python PDF 3: Writing With HTML and XML

Alas, I have discovered the potent mixture of Jinja, weasyprint and Pandas. Mixing these tools with matplotlib and Python image modules yields a way to write PDF documents with relative ease and with the styling help of HTML. It would also be able to use a tool like xmltopdf for generating pdf files from XML. Previous Posts dealt with this using a more complicated tool, PyPDF2.

A Basic HTML Template

In this tutorial, I am using jinja to create tables. My tables will not have much in the way of styling but it is also possible to add styles with jinja or by using a tool such as Django-Tables2. Both tools are incredibly similar to the Django platform.

A template is needed in order to generate HTML pages for conversion to pdf format. Jinja follows a basic format with double curly braces used to mark where items are entered encapsulating the title of the property.

<!DOCTYPE html>
<head lang="en">
<meta charset="UTF-8">
<title>{{ title }}</title>
<h1>Weekly Summary Report</h1>
{{ summary_pivot_table }}

<h1>Frequency Report</h1>
{{ frequency_table }}

<h1>Weekly Source Reports</h1>
{{ source_pivot_table }}

In this case, there is a title and three reports. It would be easy to add CSS tags and generate different styles using the division tags. These will be converted by weasyprint later.

Writing to the Template

Writing to a template with Jinja requires using the dictionary data structure.

adminVars={"title":"Weekly Statistics","frequency_table":freqFrame.to_html(),"summary_pivot_table":sframe.to_html(),"source_pivot_table":tframe.to_html()}

Generating Data

Generating data is simple with Pandas. This is especially true with databases. One only needs to connect to a database using a SQLAlchemy engine and perform any necessary query. It is also possible to concatenate as many queries as necessary to generate a table.

import sqlalchemy
import pandas

#create alchemy engine      

dsn='postgresql+psycopg2://'+cfp.getvar("db","user","string")+":"+cfp.getVar("db", "passw","string")+"@"+cfp.getVar("db","host","string")+":"+cfp.getVar

#get totals table

Concatenation is not difficult either using the concat function.

pandas.concat([pandas.read_sql_query(query,engine) for query in tables])

New columns will be generated with NaN values.

Performing Basic Operations on Dataframes

Performing operations on dataframes is easy with numpy or scipy.

import numpy

#operate on tframe from above

Dataframes themselves have operations that can be formed on them and use numpy.

#tframe from above

A list of operations is provided in the Pandas documentation.

More complicated operations may require unpacking the values or using generator functions

Using Weazy Print

Once the resources and template are prepared, simply call on weazy print to convert the html resulting from the template to a PDF.

An extra import is needed to fetch resources such as images from links embedded within the url.

Otherwise, generate a pandas data frame, conver the frame to html and place as the value attached to the appropriate template key in your dictionary and then convert. The example code uses SQLAlchemy to fetch resources from a PostgreSQL database.

from crawleraids.ConfigVars import Config
from jinja2 import Environment,FileSystemLoader
import pandas
from weasyprint import HTML,default_url_fetcher
import sqlalchemy

def fetchURL(url):
   Provide a resource obtainer for getting urls to weazy print
   return weasyprint.default_url_fetcher(url)

def generatePDF(fpath):
       Generate the pdf.
       #create alchemy engine
dsn='postgresql+psycopg2://'+cfp.getvar("db","user","string")+":"+cfp.getVar("db", "passw","string")+"@"+cfp.getVar("db","host","string")+":"+cfp.getVar("db","port","string")+"/"+cfp.getVar("db","dbname","string")
        #get totals table
        #get summary stats table
        #get frequencies
        #get the resource loader 
       #fill the template
       vars={"title":"Weekly Statistics","frequency_table":freqFrame.to_html(),"summary_pivot_table":sframe.to_html(),"source_pivot_table":tframe.to_html()}
       #use weazy print to convert to pdf

All of the power of Pandas is now at the disposal of the programmer along with anything that can be embedded in a url.

Generating and Saving Graphs with Pandas and Matplotlib or PyPlot

It is possible to embed graphs into a pdf by saving them as images.

Obviously, the folks behind pdf allow most things made of bytes to be placed in objects in a PDF (a pdf is a series of pdf objects much like xml with byte strings in base 64 as the text). See my magic numbers post and try to parse or write your own image to a pdf if you really want to dive into the subject.

Generating graphs is simple Pandas. Just make sure to match the template graph with the image url.

import matplotlib.pyplot as plt

data=[[1,3,5,2],[1,3,4]] #perform operations on the data to transform the graph. Each array is a new plot line.
df = DataFrame(data,columns=[['PlotA','PlotB']))
fig.savefig('graph.png') #also can save as own pdf to be merged as described in an earlier post

It is possible to do this directly with pyplot as well.

Using Flask to Create a PDF Web Server

It appears from comments and questions that pfd servers are often a request. The clunkiness of Spring can now be replaced easily with the combination of the mentioned tools and the Flask web framework. These tools allow for the quick and easy creation of a pdf web server. However, asyncore with socket, Spring with Java based tools, or other tools will need to be run if the plan is to use something akin to the proxy pattern, a sad state of affairs.

To create the server, simply create a method with an annotation specifying the path, much as would happen in spring.

from flask import Flask
   from cStringIO import StringIO
   import StringIO

from flask import send_file
app = Flask(__name__)

def otherFunc():

def generatePDF():
    #code to generate PDF........
    return send_file(pdf, attachment_filename='file.pdf')

if __name__ == "__main__":

Weasyprint also includes a way to incorporate pre-generated pdfs from within the same application.

Morning Joe/Python PDF Part 3: Straight Optical Character Recognition

*Due to time constraints, I will be publishing large articles on the weekends with a daily small article for the time being.

Now, we start to delve into the PDF Images since the pdf text processing articles are quite popular. Not everything PDF is capable of being stripped using straight text conversion and the biggest headache is the PDF image. Luckily, our do “no evil” (heavy emphasis here) friends came up with tesseract, which, with training, is also quite good at breaking their own paid Captcha products to my great amusement and company’s profit.

A plethora of image pre-processing libraries and a bit of post-processing are still necessary when completing this task. Images must be of high enough contrast and large enough to make sense of. Basically, the algorithm consists of pre-processing an image, saving an image, using optical character recognition, and then performing clean-up tasks.

Saving Images Using Software and by Finding Stream Objects

For linux users, saving images from a pdf is best done with Poplar Utils which comes with Fedora,CentOS, and Ubuntu distributions and saves images to a specified directory. The command format is pdfimages [options] [pdf file path] [image root] . Options are included for specifying a starting page [-f int], an ending page [-l int], and more. Just type pdfimages into a linux terminal to see all of the options.

pdfimages -j /path/to/file.pdf /image/root/

To see if there are images just type pdfimages -list.

Windows users can use a similar command with the open source XPdf.

It is also possible to use the magic numbers I wrote about in a different article to find the images while iterating across the pdf stream objects and finding the starting and ending bytes of an image before writing them to a file using the commands from open().write(). A stream object is the way Adobe embeds objects in a pdf and is represented below. The find command can be used to ensure they exist and the regular expression command re.finditer(“(?mis)(?<=stream).*?(?=endstrem)",pdf) will find all of the streams.


....our gibberish looking bytes....



Python offers a variety of extremely good tools via pillow that eliminate the need for hard-coded pre-processing as can be found with my image tools for Java.

Some of the features that pillow includes are:

  1. Blurring
  2. Contrast
  3. Edge Enhancement
  4. Smoothing

These classes should work for most pdfs. For more, I will be posting a decluttering algorithm in a Captcha Breaking post soon.

For resizing,OpenCV includes a module that avoids pixelation with a bit of math magic.

#! /usr/bin/python

import cv2



OCR with Tesseract

With a subprocess call or the use of pytesser (which includes faster support for Tesseract by implementing a subprocess call and ensuring reliability), it is possible to OCR the document.

#! /usr/bin/python

from PIL import Image

import pytesser"fpath")


If the string comes out as complete garbage, try using the pre-processing modules to fix it or look at my image tools for ways to write custom algorithms.


Unfortunately, Tesseract is not a commercial grade product for performing PDF OCR. It will require post processing. Fortunately, Python provides a terrific set of modules and capabilities for dealing with data quickly and effectively.

The regular expression module re, list comprehension, and substrings are useful in this case.

An example of post-processing would be (in continuance of the previous section):

import re


lines=[x for x in lines if "bad stuff" not in x]


for line in lines:

if"pattern ",line):

results.append(re.sub("bad pattern","replacement",line))


It is definitely possible to obtain text using tesseract from a PDF. Post-processing is a requirement though. For documents that are proving difficult, commercial software is the best solution with Simple OCR and Abby Fine Reader offering quality solutions. Abby offers the best quality in my opinion but comes at a high price with a quote for an API for just reading PDF documents coming in at $5000. I have managed to use the Tesseract approach successfully at work but the process is time consuming and not guaranteed.

Ending and Starting Bytes for Images

So, I needed to find the starting and ending bytes for images and I would like to save them somewhere. Why not here? Please let me know if I need to or can make changes to the table. If we can get the end of file bytes, it will make extraction and manipulation in documents easier.

The starting bytes are well documented but the end bytes are not.

Here is what I found regarding the common formats.

Image Type Start Bytes End Bytes Start Word End Word
JPEG 0xd8 0xff 0xff 0xd9
PNG 0x89 0x50 0x4E 0x47 0x0D 0x0A 0x1A 0x0A 0x49 0x45 0x4e 0x44 –Specs– IEND
GIFF 0x47 0x49 0x46 0x38 0x3B GIF87a | GIF9a ; [9 bit ending is 101h]
TIFF-Motorola 0x4d 0x4d 0x00 0x2a

TIFF-Intel 0x49 0x49 0x2a 0x00 II
PGM2 0x50 0x35 P5
PLain PGM 0x50 0x32 P2
PBM 0x42 0x4D P1
BMP 0x42 0x4D BM

*A GIF also marks itself by its format (7a or 9a to form the workd GIF8[format] in its magic number) and the end appears to be 101h with a EOF of 0x3B but this is a bit weird.
*The TIFF formats are apparently Big Endian for Motorola and Little Endian for Intel.
* The best statement I could find for a Tiff is that “Each strip ends with the
24-bit end-of-facsimile block (EOFB)” from a document refering to TIFF 1.0.
*A JPEG may also have the JFIF structure and be discoverable this way since JFIF is usually the first noticeable part of the image when converted to a string. (JFIF: 0x4a 0x46 0x49 0x46)
*PGM comes in two formats as does TIFF. While the tiff differences are listed, pgm differs in that plain pgm stores one image and pgm2 (both .pmg) stores more than one. Both are pure black and white, binary, photos.

GIF: |