*Due to time constraints, I will be publishing large articles on the weekends with a daily small article for the time being.
Now, we start to delve into the PDF Images since the pdf text processing articles are quite popular. Not everything PDF is capable of being stripped using straight text conversion and the biggest headache is the PDF image. Luckily, our do “no evil” (heavy emphasis here) friends came up with tesseract, which, with training, is also quite good at breaking their own paid Captcha products to my great amusement and company’s profit.
A plethora of image pre-processing libraries and a bit of post-processing are still necessary when completing this task. Images must be of high enough contrast and large enough to make sense of. Basically, the algorithm consists of pre-processing an image, saving an image, using optical character recognition, and then performing clean-up tasks.
Saving Images Using Software and by Finding Stream Objects
For linux users, saving images from a pdf is best done with Poplar Utils which comes with Fedora,CentOS, and Ubuntu distributions and saves images to a specified directory. The command format is pdfimages [options] [pdf file path] [image root] . Options are included for specifying a starting page [-f int], an ending page [-l int], and more. Just type pdfimages into a linux terminal to see all of the options.
pdfimages -j /path/to/file.pdf /image/root/
To see if there are images just type pdfimages -list.
Windows users can use a similar command with the open source XPdf.
It is also possible to use the magic numbers I wrote about in a different article to find the images while iterating across the pdf stream objects and finding the starting and ending bytes of an image before writing them to a file using the commands from open().write(). A stream object is the way Adobe embeds objects in a pdf and is represented below. The find command can be used to ensure they exist and the regular expression command re.finditer(“(?mis)(?<=stream).*?(?=endstrem)",pdf) will find all of the streams.
stream ....our gibberish looking bytes.... endstream
Python offers a variety of extremely good tools via pillow that eliminate the need for hard-coded pre-processing as can be found with my image tools for Java.
Some of the features that pillow includes are:
- Edge Enhancement
These classes should work for most pdfs. For more, I will be posting a decluttering algorithm in a Captcha Breaking post soon.
For resizing,OpenCV includes a module that avoids pixelation with a bit of math magic.
#! /usr/bin/python import cv2 im=cv2.imread("impath") im=cv2.resize(im,(im.shape*2,im.shape*2))
OCR with Tesseract
With a subprocess call or the use of pytesser (which includes faster support for Tesseract by implementing a subprocess call and ensuring reliability), it is possible to OCR the document.
#! /usr/bin/python from PIL import Image import pytesser im=Image.open("fpath") string=pytesser.image_to_string(im)
If the string comes out as complete garbage, try using the pre-processing modules to fix it or look at my image tools for ways to write custom algorithms.
Unfortunately, Tesseract is not a commercial grade product for performing PDF OCR. It will require post processing. Fortunately, Python provides a terrific set of modules and capabilities for dealing with data quickly and effectively.
The regular expression module re, list comprehension, and substrings are useful in this case.
An example of post-processing would be (in continuance of the previous section):
import re lines=string.split("\n") lines=[x for x in lines if "bad stuff" not in x] results= for line in lines: if re.search("pattern ",line): results.append(re.sub("bad pattern","replacement",line))
It is definitely possible to obtain text using tesseract from a PDF. Post-processing is a requirement though. For documents that are proving difficult, commercial software is the best solution with Simple OCR and Abby Fine Reader offering quality solutions. Abby offers the best quality in my opinion but comes at a high price with a quote for an API for just reading PDF documents coming in at $5000. I have managed to use the Tesseract approach successfully at work but the process is time consuming and not guaranteed.