Since captchas are meant to be unreadable by a computer, they are a great tool for better learning the task of OCR. As even Google now admits, Captcha’s are breakable. This is more concerning from a security standpoint, revealing that even an open source OCR like Tesseract can defeat this system. A little computer vision and some basic pre-processing in python will break most Captchas. That is why I champion the use of a mapping and analysis of click stream data with Latent Dirichlet Allocation to classify human from non-human or hacker from non-hacker (stay tuned its coming). Adding the LDA approach to a captcha system with a higher probability of failure for automated processes, guessing here, and use of click stream data to form vectors (literal mathematical vectors) and security becomes a lot better.
Let’s do some Captcha breaking but beware this is purely educational and not for breaking the law! Many Captchas have sound options to comply with handicap laws of which simpler puzzles can be broken with sound recognition such as Sphinx4. However, the dilution of the sound in modern Captchas can make OCR useful for aiding the disabled. Basically, there are uses of this code that are likely to remain legal as companies look to narrow the definition of authorized access.
Captcha images contain all sorts of clutter and manipulations with the goal of eliminating readability. This makes pre-processing critical to the goal at hand. Speed is the crucial consideration in this task so any library or custom code needs to be extremely efficient.
Two modules exist in Python that help with preprocessing. They are OpenCV (cv2) and pillow (PIL). My image toolset can also be used for these tasks.
OpenCV is a powerful open source library with the aim of making a lot of calculus and differential equation based code for computer vision incredibly easy to deploy. It runs extremely quickly. The modules are largely written in C and there is also a C++ API. OpenCV is great if you want to write custom code too as the tutorials also dive deeply into the mathematics behind each program. For this case, classes from cv2 including resize (which goes further than basic expansion),Canny edge detection, and blurring are highly effective. After writing the classes in Java and even using a graphics card library in python to do the same tasks, it appears that OpenCV matches or is only slightly slower than the custom code. The images are small though.
Other modules are incredibly good at performing contouring, corner detection, and key point finding. If you have a computer vision or artificial intelligence task, Open CV is the go-to API.
For basic pre-processing, pillow is also an incredibly fast library. Again, compared to my own Java code, the modules work at about the same speed. The idea behind them is the use of kernels, small matrices filled with weights that can be used to distribute color in an image via a window.
from PIL import ImageEnhance
from PIL import ImageFilter
from PIL import Image
All of the necessary pre-processing, whether custom or module based can be completed in less than one second, producing the result shown below. However, it is necessary to fiddle with the images until they look as close to they way a normal input would.
Overall, the total time taken to break a captcha ranges from roughly one second or less to four seconds on a dual core machine with 4gv of RAM. Completing tasks with custom code may improve speed when using faster matrix libraries but numpy is fairly efficient in today’s world.
One extremely useful trick is to resize the image and improve contrast.
If using numpy, there is an extremely useful way to apply a function to all pixels of an image as well using some Python magic.
Decluttering with Statistics
Certain transforms and other techniques may leave unwanted clutter in the image. Ridding small or abnormally sized objects from an image is performable with basic statistics. Remember that 1.5 standard deviations [sum(x-mean)^2/n] is a normal outlier and 3 standard deviations is an extreme outlier. This can be used to eliminate elements that are longer than others. The example below follows an object and eliminates it based on width and has proven successful. If vertical objects are present, surface area coverage may be a better consideration. These work better than contouring here because the images are not always connected properly. They need to be readable by a human and not a computer.
"""Declutter an Image"""
#get the avg, total
for i in range(height):
for c in arr[i]:
if c < 128 and account is True:
for n in wsarr:
for i in range(height):
for c in arr[i]:
if c128 and account is True:
if (j-ws) > (avg+o) or (j-ws) <(avg-o):
for j in range(j-ws):
print str(total)+" objects removed"
Rotating with the Bounding Box and an Ode to OpenCV
In order to complete our work, it is a good idea to know how to find the minimum bounding box and rotate the image. This is made difficult in many instances by the fact that the letters in a Captcha are not always continuous black lines. Thankfully, OpenCV contains a multitude of feature detectors that can help outline or find key features for the different objects.
My initial attempts at this involved a gaggle of different methods. After attempts at using a threshold with Euclidean distances to produce a set further reduced by basic statistics revolving around centroids, and a few other methods, I arrived at contours. The Euclidean distance method would likely work on regular text lines but, here, characters are intentionally smashed together or unwanted lines mixed in. I kept getting double letters with different Captchas. The result of using these methods is the feeling of frustration.
In contrast to the feeling of being lost, OpenCV’s contour class can find angles, bounding boxes that are rotated, and many other useful features. Contours rock.
A contour basically uses defined edges. Algorithms include using Latent Dirichlet Allocation to group objects found from edge detection and marching squares with defined edges following certain angles and patterns, much like a Haar cascade in a way. Obviously any AI implementation is much stronger.
Get Contours from CV2
With the contours found, it is a good idea to discover any patterns in spacing and use those to combine the boxes with an expanded intersection condition and a while loop.
Now, we can rotate our individual letters. The equation here is simple and the code can be written easily. The following kernel is useful for rotation about the center of an image.
The code for the equation in scipy is as follows. It is possible to use the above matrix on each pixel to discover the new location as well. I wrote the kernel into the sample.
Utilizes contouring via cv2 to recognize letters and rotate them.
Takes in a numpy array to work with and returns a set of numpy arrays.
This will rotate on a provided contour.
:param img: image as a numpy array
:param cnt: contour to rotate on
#get the basic points
#get the rotational angle
print "Rotational Degrees: "+str(degree)
#rotate with interpolation in scipy
Another, more powerful rotation uses the same ndimage (whose cropping algorithm likely interpolates much better than simply multiplying [(x,y),(x1,y1)] by [(x),(y)] and crops. This uses a best fit line with slope found using one of a variety of least squares calculations. The cv2 function used returns vectors collinear to x and y with length one in the other place.
Takes in contour information and rotates based on
the line of best fit through contour points. I discovered
that this could be done after getting the original box though
the box may reduce the effect of large skews.
This program also considers the minimum area rectangle in
case a letter is actually skewed only stretched (skew not detectable from box). I'm guessing
that these will not always be the same since a statistical best fit
is not always the same as finding the absolute max and min points.
:param image: the image to rotate
:param cnt: the contour information
:param tolerance: the allowable tolerance (deviation from 0 degrees in radians)
print "BoxPheta: "+str(boxpheta)
if abs(boxpheta-d90) > tolerance:
#find the perpendicular slope to the given slope
if vx is 0 and vy is 1:
print "Slope Points: "+str(vx)+","+str(vy)
print "Slope: "+str(slope)
print "Pheta: "+str(pheta)
print "Pheta (degrees)"+str((pheta*180)/math.pi)
if vx >0:
If rotation is necessary, it may be possible to stick letters in a set and then attempt to read them individually rather than stitch an image together. Since a white background is present, it is also possible to rotate the letters and stick them back into the image. Please read the alignment section with code regarding alignment to see how this is done.
Aligning individual images is normally necessary, especially for systems considering shapes that use LDA or a similar AI task to learn different groupings. If a u is too far above a proceeding letter like an S, the response from the system may be ” instead of u.
Using cv2, it is not difficult to align letters that are nearly good enough to run through OCR. Open CV includes powerful contouring tools based on machine learning techniques. Contouring allows individual letters to be discerned, allowing for even more techniques to be applied such as a rotational matrix to the image matrix as described in rotation.
The proceeding code exemplifies the process of alignment. It does not, however, consider whether a letter has a stem.
Align Letters in a Pre-Processed Image. Options are provided
to limit the size of accepted bounding boxes, helping eliminate
non-contours and the usual box covering the image as a whole.
Returns a PIL Image.
:param img: numpy array image to use
:param maxBoxArea: maximum bounding box area to accept (None is a non check)
:param minBoxArea: minimum bounding box area to accept (None is a non check)
:param printBoxArea: boolean to state whether to print box areas (list trimmed if maxBoxArea or minBoxArea set)
:param imShow: show the individual contoured images and final image
#get the contours and bounding boxes and filter out bad bounding boxes
for cnt in contours:
#obtain only bounding boxes meeting a certain criteria
if (maxBoxArea is None or w*hminBoxArea):
if printBoxArea is True:
#write each box to a new image aligned at the bottom
for x in istarts:
if imShow is True:
for i in range(x,x+imgx):
if imShow is True:
Some Final Processing
Your image should either be split up and stored in a set of characters or look fairly discernable by this point. If it is not, take the time to do some final pre-processing. Using pillow (PIL) to expand edges and eliminate disconnects is one important task for final processing. Try to make the image look like newsprint.
Tesseract: The Easy Part
Now, the image is ready for Tesseract. The command line code can be run via the subprocess library from a temporary file or a pipe can be established directly to Tesseract. The OCR splits the letters into key features, clusters on them, and then either compares the offsets of characters it finds to a letter set or runs a co-variance comparison. I personally will test the co-variance approach in a later article and am building a massive training set.
If you have a set of letters, be sure to use the -psm option and set it to 10. This tells Tesseract to perform single character analysis.
Running Tesseract is simple with pytesser.
from PIL import Image
Please. Do not Spam. I am not responsible for the misuse of this Code. This is purely educational!!!
For shits and giggles, here is an LDA algorithm that can be used with a covariance matrix and your own really,really big set of letters from sci-kit learn. Python really is excellent for AI, much easier to use than Java if you want infinite matrix size thanks to gensim,numpy, and sci-kit learn. Dare I say numpy is even faster at matrix solving than almost all Java packages when not using a tool like Mahout due to the use of BLAS and LaPack when possible. That is in part why I used Tesseract. The other is that goolge runs or ran a Captcha service while also supplying this really amazing OCR product.