Secure Your Data Series: Why a Captcha Alone Fails

Since captchas are meant to be unreadable by a computer, they are a great tool for better learning the task of OCR. As even Google now admits, Captcha’s are breakable. This is more concerning from a security standpoint, revealing that even an open source OCR like Tesseract can defeat this system. A little computer vision and some basic pre-processing in python will break most Captchas. That is why I champion the use of a mapping and analysis of click stream data with Latent Dirichlet Allocation to classify human from non-human or hacker from non-hacker (stay tuned its coming). Adding the LDA approach to a captcha system with a higher probability of failure for automated processes, guessing here, and use of click stream data to form vectors (literal mathematical vectors) and security becomes a lot better.

Let’s do some Captcha breaking but beware this is purely educational and not for breaking the law! Many Captchas have sound options to comply with handicap laws of which simpler puzzles can be broken with sound recognition such as Sphinx4. However, the dilution of the sound in modern Captchas can make OCR useful for aiding the disabled. Basically, there are uses of this code that are likely to remain legal as companies look to narrow the definition of authorized access.

Image Preprocessing

Captcha images contain all sorts of clutter and manipulations with the goal of eliminating readability. This makes pre-processing critical to the goal at hand. Speed is the crucial consideration in this task so any library or custom code needs to be extremely efficient.

Two modules exist in Python that help with preprocessing. They are OpenCV (cv2) and pillow (PIL). My image toolset can also be used for these tasks.

OpenCV is a powerful open source library with the aim of making a lot of calculus and differential equation based code for computer vision incredibly easy to deploy. It runs extremely quickly. The modules are largely written in C and there is also a C++ API. OpenCV is great if you want to write custom code too as the tutorials also dive deeply into the mathematics behind each program. For this case, classes from cv2 including resize (which goes further than basic expansion),Canny edge detection, and blurring are highly effective. After writing the classes in Java and even using a graphics card library in python to do the same tasks, it appears that OpenCV matches or is only slightly slower than the custom code. The images are small though.

Other modules are incredibly good at performing contouring, corner detection, and key point finding. If you have a computer vision or artificial intelligence task, Open CV is the go-to API.

import cv2

For basic pre-processing, pillow is also an incredibly fast library. Again, compared to my own Java code, the modules work at about the same speed. The idea behind them is the use of kernels, small matrices filled with weights that can be used to distribute color in an image via a window.

from PIL import ImageEnhance
from PIL import ImageFilter
from PIL import Image

All of the necessary pre-processing, whether custom or module based can be completed in less than one second, producing the result shown below. However, it is necessary to fiddle with the images until they look as close to they way a normal input would.

Overall, the total time taken to break a captcha ranges from roughly one second or less to four seconds on a dual core machine with 4gv of RAM. Completing tasks with custom code may improve speed when using faster matrix libraries but numpy is fairly efficient in today’s world.

One extremely useful trick is to resize the image and improve contrast.'captcha.jpg').convert('L')


If using numpy, there is an extremely useful way to apply a function to all pixels of an image as well using some Python magic.


Decluttering with Statistics

Certain transforms and other techniques may leave unwanted clutter in the image. Ridding small or abnormally sized objects from an image is performable with basic statistics. Remember that 1.5 standard deviations [sum(x-mean)^2/n] is a normal outlier and 3 standard deviations is an extreme outlier. This can be used to eliminate elements that are longer than others. The example below follows an object and eliminates it based on width and has proven successful. If vertical objects are present, surface area coverage may be a better consideration. These work better than contouring here because the images are not always connected properly. They need to be readable by a human and not a computer.

def declutter(self,inarr):
        """Declutter an Image"""
        #get the avg, total
        for i in range(height):
            for c in arr[i]:
                if c < 128 and account is True:
        #calculate sd
        for n in wsarr:
        #perform declutter
        for i in range(height):
            for c in arr[i]:
                if c128 and account is True:
                    if (j-ws) > (avg+o) or (j-ws) <(avg-o):
                        for j in range(j-ws):
        print str(total)+" objects removed"
        return (arr,total)       


Rotating with the Bounding Box and an Ode to OpenCV

In order to complete our work, it is a good idea to know how to find the minimum bounding box and rotate the image. This is made difficult in many instances by the fact that the letters in a Captcha are not always continuous black lines. Thankfully, OpenCV contains a multitude of feature detectors that can help outline or find key features for the different objects.

My initial attempts at this involved a gaggle of different methods. After attempts at using a threshold with Euclidean distances to produce a set further reduced by basic statistics revolving around centroids, and a few other methods, I arrived at contours. The Euclidean distance method would likely work on regular text lines but, here, characters are intentionally smashed together or unwanted lines mixed in. I kept getting double letters with different Captchas.  The result of using these methods is the feeling of frustration.

In contrast to the feeling of being lost, OpenCV’s contour class can find angles, bounding boxes that are rotated, and many other useful features. Contours rock.

A contour basically uses defined edges. Algorithms include using Latent Dirichlet Allocation to group objects found from edge detection and marching squares with defined edges following certain angles and patterns, much like a Haar cascade in a way. Obviously any AI implementation is much stronger.

   def getContours(self,img):
        Get Contours from CV2
        return cv2.findContours(thresh,cv2.RETR_TREE,cv2.CHAIN_APPROX_SIMPLE)    

With the contours found, it is a good idea to discover any patterns in spacing and use those to combine the boxes with an expanded intersection condition and a while loop.

Now, we can rotate our individual letters. The equation here is simple and the code can be written easily. The following kernel is useful for rotation about the center of an image.

The code for the equation in scipy is as follows. It is possible to use the above matrix on each pixel to discover the new location as well. I wrote the kernel into the sample.

   def rotate(self,img,cnt):
        Utilizes contouring via cv2 to recognize letters and rotate them.
        Takes in a numpy array to work with and returns a set of numpy arrays.
        This will rotate on a provided contour.
        *Required Parameters*
        :param img: image as a numpy array
        :param cnt: contour to rotate on
            #get the basic points
            #get the rotational angle
            print "Rotational Degrees: "+str(degree)
            return im
        #rotate with interpolation in scipy
        return ndimage.rotate(im,degree,mode='nearest',cval=100)

Another, more powerful rotation uses the same ndimage (whose cropping algorithm likely interpolates much better than simply multiplying [(x,y),(x1,y1)] by [(x),(y)] and crops. This uses a best fit line with slope found using one of a variety of least squares calculations. The cv2 function used returns vectors collinear to x and y with length one in the other place.

   def rotateFromPixels(self,image,cnt,tolerance=(10*math.pi)/180):
        Takes in contour information and rotates based on
        the line of best fit through contour points. I discovered
        that this could be done after getting the original box though
        the box may reduce the effect of large skews.
        This program also considers the minimum area rectangle in 
        case a letter is actually skewed only stretched (skew not detectable from box). I'm guessing
        that these will not always be the same since a statistical best fit
        is not always the same as finding the absolute max and min points.
        *Required Parameters*
        :param image: the image to rotate
        :param cnt: the contour information
        *Optional Parameters*
        :param tolerance: the allowable tolerance (deviation from 0 degrees in radians)
        print str(math.atan(y1-y0/x1-x0))
        print "BoxPheta: "+str(boxpheta)
        if abs(boxpheta-d90) > tolerance:
            #find the perpendicular slope to the given slope
            if vx[0] is 0 and vy[0] is 1:
                return image2
                print "Slope Points: "+str(vx[0])+","+str(vy[0])
                print "Slope: "+str(slope)
                print "Pheta: "+str(pheta)
                print "Pheta (degrees)"+str((pheta*180)/math.pi)
                print "\n\n\n\n\n"
            if vx[0] >0:
        return image2 

If rotation is necessary, it may be possible to stick letters in a set and then attempt to read them individually rather than stitch an image together. Since a white background is present, it is also possible to rotate the letters and stick them back into the image. Please read the alignment section with code regarding alignment to see how this is done.


Aligning individual images is normally necessary, especially for systems considering shapes that use LDA or a similar AI task to learn different groupings. If a u is too far above a proceeding letter like an S, the response from the system may be ” instead of u.

Using cv2, it is not difficult to align letters that are nearly good enough to run through OCR. Open CV includes powerful contouring tools based on machine learning techniques. Contouring allows individual letters to be discerned, allowing for even more techniques to be applied such as a rotational matrix to the image matrix as described in rotation.

The proceeding code exemplifies the process of alignment. It does not, however, consider whether a letter has a stem.

   def alignLetters(self,img,maxBoxArea=None,minBoxArea=None,printBoxArea=False,imShow=False):
        Align Letters in a Pre-Processed Image. Options are provided
        to limit the size of accepted bounding boxes, helping eliminate 
        non-contours and the usual box covering the image as a whole.
        Returns a PIL Image.
        *Required Parameters*
        :param img: numpy array image to use
        *Optional Parameters*
        :param maxBoxArea: maximum bounding box area to accept (None is a non check)
        :param minBoxArea: minimum bounding box area to accept (None is a non check)
        :param printBoxArea: boolean to state whether to print box areas (list trimmed if maxBoxArea or minBoxArea set)
        :param imShow: show the individual contoured images and final image
        #setup image
        #get the contours and bounding boxes and filter out bad bounding boxes
        for cnt in contours:
            #obtain only bounding boxes meeting a certain criteria
            if (maxBoxArea is None or w*hminBoxArea):
                if printBoxArea is True: 
                    print str(w*h)
                if h>maxheight:
                if w>maxwidth:
        #write each box to a new image aligned at the bottom
        for x in istarts:
            if imShow is True:
            if x+imgx>maxWidth:
            for i in range(x,x+imgx):
                if minyminy:
                    endImg.putpixel((i,height), img.getpixel((i-x,iheight)))
        endImg=endImg.crop((0,minHeight, maxWidth,maxheight))
        if imShow is True:
        return endImg

Some Final Processing

Your image should either be split up and stored in a set of characters or look fairly discernable by this point. If it is not, take the time to do some final pre-processing. Using pillow (PIL) to expand edges and eliminate disconnects is one important task for final processing. Try to make the image look like newsprint.

Tesseract: The Easy Part

Now, the image is ready for Tesseract. The command line code can be run via the subprocess library from a temporary file or a pipe can be established directly to Tesseract. The OCR splits the letters into key features, clusters on them, and then either compares the offsets of characters it finds to a letter set or runs a co-variance comparison. I personally will test the co-variance approach in a later article and am building a massive training set.

If you have a set of letters, be sure to use the -psm option and set it to 10. This tells Tesseract to perform single character analysis.

Running Tesseract is simple with pytesser.

  import pytesser
  from PIL import Image

Please. Do not Spam. I am not responsible for the misuse of this Code. This is purely educational!!!

For shits and giggles, here is an LDA algorithm that can be used with a covariance matrix and your own really,really big set of letters from sci-kit learn. Python really is excellent for AI, much easier to use than Java if you want infinite matrix size thanks to gensim,numpy, and sci-kit learn. Dare I say numpy is even faster at matrix solving than almost all Java packages when not using a tool like Mahout due to the use of BLAS and LaPack when possible. That is in part why I used Tesseract. The other is that goolge runs or ran a Captcha service while also supplying this really amazing OCR product.

Installing Hadoop on Windows 8.1 with Visual Studio 2010 Professional

Looking for a matrix system that work like gensim in Python, I discovered Mahout. Wanting to test this against the Universal Java Matrix Package, I decided to give the install a try. That, unfortunately was a long side-tracked road going well beyond the requirements listed in the install file.

In the end, I found a way that did not require Cygwin and went smoothly without requiring building packages in the Visual Studio IDE.

Installation instructions follow.

It is possible to download the hadoop install from a variety of mirrors.

The x64 Platform

Before starting, understand that hadoop uses a x64 system. x86-x64 works as well and using x32 installations of Cmake and the jdk will not harm the project. However, it is a 64 bit program and requires a Visual Studio 10 2010 Win 64 generator to compile the hdfs project files.

Uninstall Visual Studio 2010 Express and Distributables
Visual Studio 2010 Express uses a C++ distributable that will cause the command prompt for Windows SDK to fail and will also conflict with some of the build using the Visual Studion Command Prompt.


The following requirements are necessary in no particular order.

  1. Microsoft Visual Studio 2010 Professional with C++
  2. Install the .Net 4.0 framework
  3. Zlib
  4. Most recent Maven
  5. MSBuild
  6. CMake
  7. Protoc
  8. Java JDK 1.7

Path Variables

The following must be in your path. The only order should be, if you have and wish to use Cygwin to place MS Visual studio before Cygwin to get rid of a copy of cmake that will not work for this task. It is better to just delete cmake on Cygwin and use it for Windows if this is the path you choose.

  1. MSBuild
  2. Cmake
  3. Visual Studio 2010
  4. Zlib
  5. protoc
  6. java

Environment Variables

The following can be set for an individual instance of command prompt.

  1. JAVA_HOME=path to jdk
  2. M2_HOME=path to maven
  3. VCTargetsPath=set to MSBuild/Microsoft.CPP/4.0 or other valid path to the CPP properties file
  4. Platform=x64

Run the Build

Open up a Visual Studio 2010 Win 64 command prompt and type the following command.

mvn package -Pdist,native-win -DskipTests -Dtar

Resulting Files

The following files should appear in your unzipped haddoop file under hadoop-dist/target.

  1. hadoop-2.6.X.tar
  2. hadoop-dist-2.6.X.jar

Special Thanks

Special thanks to the IT admin and security professional contractor at Hygenics Data LLC for the copy of Microsoft Visual Studio 2010.

Happy hadooping or being a Mahout.