A Cool Quick Find in Python

I have been looking for a way to read a string as a file in Python for a while. Basically, I want to be able to read images without first saving them, like with sometimes using Java’s .getBytes() method for strings combined with the ByteArrayInputStream  when offered the use of a stream input by an API method as a method parameter. The answer, BytesIO. This amazing class works wonders.

from io import BytesIO
from PIL import Image

Morning Joe/Python PDF Part 3: Straight Optical Character Recognition

*Due to time constraints, I will be publishing large articles on the weekends with a daily small article for the time being.

Now, we start to delve into the PDF Images since the pdf text processing articles are quite popular. Not everything PDF is capable of being stripped using straight text conversion and the biggest headache is the PDF image. Luckily, our do “no evil” (heavy emphasis here) friends came up with tesseract, which, with training, is also quite good at breaking their own paid Captcha products to my great amusement and company’s profit.

A plethora of image pre-processing libraries and a bit of post-processing are still necessary when completing this task. Images must be of high enough contrast and large enough to make sense of. Basically, the algorithm consists of pre-processing an image, saving an image, using optical character recognition, and then performing clean-up tasks.

Saving Images Using Software and by Finding Stream Objects

For linux users, saving images from a pdf is best done with Poplar Utils which comes with Fedora,CentOS, and Ubuntu distributions and saves images to a specified directory. The command format is pdfimages [options] [pdf file path] [image root] . Options are included for specifying a starting page [-f int], an ending page [-l int], and more. Just type pdfimages into a linux terminal to see all of the options.

pdfimages -j /path/to/file.pdf /image/root/

To see if there are images just type pdfimages -list.

Windows users can use a similar command with the open source XPdf.

It is also possible to use the magic numbers I wrote about in a different article to find the images while iterating across the pdf stream objects and finding the starting and ending bytes of an image before writing them to a file using the commands from open().write(). A stream object is the way Adobe embeds objects in a pdf and is represented below. The find command can be used to ensure they exist and the regular expression command re.finditer(“(?mis)(?<=stream).*?(?=endstrem)",pdf) will find all of the streams.


....our gibberish looking bytes....



Python offers a variety of extremely good tools via pillow that eliminate the need for hard-coded pre-processing as can be found with my image tools for Java.

Some of the features that pillow includes are:

  1. Blurring
  2. Contrast
  3. Edge Enhancement
  4. Smoothing

These classes should work for most pdfs. For more, I will be posting a decluttering algorithm in a Captcha Breaking post soon.

For resizing,OpenCV includes a module that avoids pixelation with a bit of math magic.

#! /usr/bin/python

import cv2



OCR with Tesseract

With a subprocess call or the use of pytesser (which includes faster support for Tesseract by implementing a subprocess call and ensuring reliability), it is possible to OCR the document.

#! /usr/bin/python

from PIL import Image

import pytesser



If the string comes out as complete garbage, try using the pre-processing modules to fix it or look at my image tools for ways to write custom algorithms.


Unfortunately, Tesseract is not a commercial grade product for performing PDF OCR. It will require post processing. Fortunately, Python provides a terrific set of modules and capabilities for dealing with data quickly and effectively.

The regular expression module re, list comprehension, and substrings are useful in this case.

An example of post-processing would be (in continuance of the previous section):

import re


lines=[x for x in lines if "bad stuff" not in x]


for line in lines:

if re.search("pattern ",line):

results.append(re.sub("bad pattern","replacement",line))


It is definitely possible to obtain text using tesseract from a PDF. Post-processing is a requirement though. For documents that are proving difficult, commercial software is the best solution with Simple OCR and Abby Fine Reader offering quality solutions. Abby offers the best quality in my opinion but comes at a high price with a quote for an API for just reading PDF documents coming in at $5000. I have managed to use the Tesseract approach successfully at work but the process is time consuming and not guaranteed.

Morning Joe: Using the command pattern in Spring

So, that ArrayList of Strings that you’ve been using to store activities and with their respective run commands. There is a design pattern that can better ground it. Spring has the ability to implement the Command Pattern extremely well.

This pattern offers guidance on the storage and use of commands at a later time. Utilizing this pattern in a stable and strong way will ensure the survival of long running code. It is always a good practice to develop around established patterns.

In this pattern, the client keeps a list of commands, I personally make a command default level class with a specific name and sequence number and place the list in another default class, which passes commands to an invoker that, in turn, runs them. The base UML is available below. You may need to click on it to see it better (that is a WordPress theme issue).


Obviously, the Object class and MainApp class needs to be flushed out and given an appropriate name. The Invoker class has been given an invoke command which will take in the command, access its sequence information and check to see if it is in order, run the command, and increment the counter. The counter integer and inOrder method are there to warn a user that something is amiss. Commands should be an ArrayList of Commands unless the number of Commands is set. A sort is included to make the pattern work better with sloppy users that accesses the ArrayList. Everything is under the same package in this design. The state in the object class helps in tracking an object if the programmer decides to reuse the class or create a static function that must be called again,especially in multithreaded environments.

Creating a series of commands to be run can be done through a configuration file or a user interface. Java interface classes would help to ensure that every class has a standard way of being accessed as needed. Personally, I like to include a run instance method which also allows for code to be more easily be reconfigured for a multi-threaded environment.

The interface may look like:

public interface CommandPattern{
       public void run();

In addition to storing commands, if a command is to be run directly after initialization of a class, it is probably better to use Java EE’s post construct annotation. The annotation signals that a class should be run directly after the program and beans initialize. Just write the annotation @PostConstruct on top of a method signature.

Finally, another potential benefit of Spring is that storing and using properties as needed is extremely useful. The configuration file already contains much of the needed code. Of course, it is possible to specify when these instances are created. The Spring framework contains eager instantiation for when an object may need to be called by a user or used in a concurrent environment. Both are controlled through the lazy-init attribute.

Simple,easy, hopefully starts getting some grounding for solid command creation in Spring programs. Cheers!

Morning Joe: Why IT Needs Software Requirements and the Disaster of Diving In Headfirst

First off, I plan on writing more technical articles at night but just moved in to a new apartment. I have 4 current articles in my queue dealing with optical character recognition of pdfs, tables, and captchas as well as creating a wireless Arduino device.

That said, I am now on my fifth call to IT regarding what should be the simple task of setting up my email in Office 365. I use Linux, as a large number of developers and techies do and sadly, each step to get my email online has been a painstaking process with IT only fixing the problem in front of its face. I have had to have the service first reinitialized, then a key attached to my account, and, finally, I am wrangling access to each product I should already have.

Something tells me that software analysis and the process of discovery could have saved my pain and reduced the number of insults hurled at a few college kids looking to make some side money with 0 industry skills. Of course, that is what we do with college students and immigrants, pay them peanuts, give them a little, and let them handle the work no one wants. On the other hand, the massive IT budgets that make my own look like a needle in a haystack are being misappropriated to pay for God knows what.

In today’s environment, where a day can cost a lifetime, this needs to change. Deploying resources to study upgrades and major changes is a must and that means analyzing the means of failure and working around them.

Failures come from a variety of sources. Sara Base offers a good review of them in A Gift of Fire. They include:

  1. Lack of understanding of the material that can be overcome with research
  2. Arrogance
  3. Sloppy user interfaces
  4. Too Little Testing
  5. Failure to understand a products uses and potential conflicts
  6. Funding
  7. The pressure for profit
  8. Product loyalty and fads (one that I have personally noticed)
  9. An unqualified workforce (another thing I have noticed)
  10. Not understanding the depth and needs of the user base to an adequate degree

In this instance, issues 1, possibly 2, 4,5,8, and 9,10 have created a perfect storm that is now harming every aspect of the institution’s communications.

Most of the issues can be overcome with research and testing of a product. I implore IT to take these issues to heart. It costs time and a great deal of money not to do this.

Morning Joe: Python-Is the Language Everyone is Trying to Create Already Around?

NodeJs, Ruby on Rails, VB Script and more are all seemingly targeted at creating an easier to code platform to replace Java,C++, and C# as dominant languages in certain spheres such as web and server development where quick deployment may mean more than a faster running language (Java and C# run at the same speed by the way). Each is hailed as a smaller, more compact language and yet each is fairly slow. It appears that the real title of simple to write replacement language that runs well on today’s fast and memory intensive machines came around in 1990. Meet Python, the utility knife language that can claim the title of “already done it” or “already on it” for just about every supposed advancement in programming for over the past 34 years. It is a strong scientific language in addition to being quick to deploy and easy to learn. Rather than bring about a comparison to the board game Othello, a second to learn but a lifetime to master, it is truly a language for all comers

The language is compact, weak typed, and can replace just about any Java code in half or even less lines of code.


  1. Small and compact: most Python code is aimed at quick and easy deployment. Modules exist for just about anything (see the Python Package library or even your local Linux yum or rpm repository)
  2. Cross-platform
  3. Map, filter, and Reduce are already included in the standard modules.
  4. Weak typing allows for the guessing of types and reuse of variables more easily
  5. It supports nearly every server implementation and has for decades, from Apache to Nginx
  6. It works with Spring (included already in 2.7 or higher) and has a dedicated web framework in Djanjo
  7. Lambdas and list comprehension have existed well before Java 8, reducing code such as for loops to a single line.
  8. Python is both a scripting and object oriented language, giving access well beyond running code from the command line.
  9. Major organizations and key players have created applications ranging from the Google Apps Engine to OpenCV’s artificial intelligence algorithms and PyCuda for accessing the NVidia graphics card
  10. Engineers and scientists are increasingly using the language with microprocessor boards available at low cost (Raspberry Pi and the MicroPy boards are two of the Python based microprocessor boards)
  11. Modules are easy to deploy since all programs are already modules, much like with java classes accessing each other.
  12. Code can be deployed without actually pre-compiling anything (a benefit for quick access and a curse for runtime speed).
  13. Has extensive support for statistical and mathematical programming as well as graphing and image manipulation with MatPlotLib,OpenCv, Numpy, and SciPy
  14. Can manage and create subprocesses easily.
  15. Network Programming, web server programming with frameworks or cgi/fcgi, multi-processing, and other server development are less code intensive compared to C# and Java
  16. Installing new plugins can be done without an internet search from easy_install or the PIP repository and, on linux, from the rpm or yum repositories.
  17. Includes support for running other languages or can be used to control other runtime environments,the JVM included, using Cython for C,IronPython for .NET, and Jython for Java.
  18. Modules are available for most if not all databases from Oracle and PostgreSQL to MySQL as well as applications for big data such as Cassandra
  19. Developed by an open source community and monitored by a committee with a budding certification track co-developed by the publisher O’Reilly and the chair of the Python committee. The University of Washington also offers a certificate. I would consider this as more of a testament to the popularity and extensive coverage of the language as opposed to the solid certifications requiring extensive theoretical and/or technical knowledge from Cisco, CompTIA, or Oracle.
  20. Over 30 years of continuous development,even more than Java.


  1. 1000 times slower than C or 100 times slower using CPython yet still comparable to or better than Ruby on Rails and the other “replacement” languages.
  2. Many modules require a variety of unstated dependencies discovered at installation time.
  3. Like Java, many of the activities require modules.

Morning Joe: Are Nesting Loop Comprehenders in Python Somehow Faster than Nested Loops?

As I learn more about Python at work, I have come across list comprehension functions and lambdas. The list comprehender format is [x for x in [] if condition] and the lambda format is (lambda x,y: function, x,y). The former seems extremely fast but is it faster than a nested loop when combined in a similar way?

Logic dictates that both will run in O(n^2) time. I performed a basic test on one, two, and four loops using the time module where a list of 100000 integers is created and then an iteration moves down the list, adding a random number to each slot (pretty easy but I need to get to working). Obviously, the third loop is skipped so that the triple nesting actually has an effect.

The programs are run 32 times apiece (CLT numbers), and averaged. The test computer is a Lenovo laptop with a 2.0 ghz dual core processor and 4 gb of RAM. The results are below.

One Loop

    for i in range(0,32):
        for i in range(0,100000):
        list=[x+random.randint(1,100) for x in list]
    print "Comprehension Avg. Time "+str(avg/32)
    for i in range(0,32):
        for i in range(0,100000):
        for i in range(0,len(list)):
    print "Loop Avg. Time:"+str(avg/32)

Loop Time: .24775519222(seconds)
Comprehension Function Time: .246111199926 (seconds)

Doubly Nested Loop
In the presence of a time constraint, I have only included the loop code. The remaining code remains the same.

     list=[x+random.randint(1,100) for x in [y+random.randint(1,100) for y in list]]

     for i in range(0,2):
            for j in range(0,len(list)):

Loop Time: 0.502061881125 (seconds)
Comprehension Function Time: 0.451432295144 (seconds)

Quadruple Nested Loop

     list=[x+random.randint(1,100) for x in [y+random.randint(1,100) for y in [z+random.randint(1,100) for z in [a+random.randint(1,100) for a in list]]]]

     for i in range(0,2):
            for j in range(0,2):
                for k in range(0,len(list)):

Loop Time: 1.08803078532 (seconds)
Comprehension Function Time: .951290503144(seconds)

As the time complexity measurements suggest, the time for each goes up with each level of nesting. However, the differences are extremely small. If any one has an advantage, it seems to be the comprehension but that difference is to small to declare a winner. That difference could be due to the random number function or other factors. The slight difference does seem to grow with each loop though. I would perform a more comprehensive test before making any determinations but comprehensions are faster to write anyway.

Ending and Starting Bytes for Images

So, I needed to find the starting and ending bytes for images and I would like to save them somewhere. Why not here? Please let me know if I need to or can make changes to the table. If we can get the end of file bytes, it will make extraction and manipulation in documents easier.

The starting bytes are well documented but the end bytes are not.

Here is what I found regarding the common formats.

Image Type Start Bytes End Bytes Start Word End Word
JPEG 0xd8 0xff 0xff 0xd9
PNG 0x89 0x50 0x4E 0x47 0x0D 0x0A 0x1A 0x0A 0x49 0x45 0x4e 0x44 –Specs– IEND
GIFF 0x47 0x49 0x46 0x38 0x3B GIF87a | GIF9a ; [9 bit ending is 101h]
TIFF-Motorola 0x4d 0x4d 0x00 0x2a

TIFF-Intel 0x49 0x49 0x2a 0x00 II
PGM2 0x50 0x35 P5
PLain PGM 0x50 0x32 P2
PBM 0x42 0x4D P1
BMP 0x42 0x4D BM

*A GIF also marks itself by its format (7a or 9a to form the workd GIF8[format] in its magic number) and the end appears to be 101h with a EOF of 0x3B but this is a bit weird.
*The TIFF formats are apparently Big Endian for Motorola and Little Endian for Intel.
* The best statement I could find for a Tiff is that “Each strip ends with the
24-bit end-of-facsimile block (EOFB)” from a document refering to TIFF 1.0.
*A JPEG may also have the JFIF structure and be discoverable this way since JFIF is usually the first noticeable part of the image when converted to a string. (JFIF: 0x4a 0x46 0x49 0x46)
*PGM comes in two formats as does TIFF. While the tiff differences are listed, pgm differs in that plain pgm stores one image and pgm2 (both .pmg) stores more than one. Both are pure black and white, binary, photos.

JPG: http://en.wikipedia.org/wiki/JPEG
PNG: http://en.wikipedia.org/wiki/Portable_Network_Graphic
GIF: http://en.wikipedia.org/wiki/Graphics_Interchange_Format | http://www.onicos.com/staff/iz/formats/gif.html
TIFF: http://www.fileformat.info/format/tiff/egff.htm

Morning Joe: Legality of Acquiring Scraped Data

One of my tasks at the entry level besides basic normalization, network programming, ETL, and IT work is to acquire data using just about anything. Being in the US this sort of data acquisition can be problematic.

I did some research since recent court rulings seem a bit mixed. Legally, in the US, there are a few factors that seem to be important.

Illegal acts obviously include targeting others in an attack. Are you doing anything that is akin to hacking or gaining unauthorized access via the Computer Fraud and Abuse Act. Exploiting vulnerabilities and passing SQL in the URL to open a database no matter how bad the idiot programming like that was is illegal at the felony level with a 15 year sentence (see the cases where an individual exploited security vulnerabilities in Verizon). Also, add a time out even if you round robin or use proxies. DDoS attacks are attacks. 1000 requests per second can shut down a lot of servers providing public information. The result here is up to 15 years in jail.

Copyright law is starting to becom important as well though. Pure replication of data that is protected is illegal. Even 4% replication has been deemed a breach. With the recent gutting of the DMCA, a person is even more vulnerable to civil and criminal penalties if somewhat knowingly or negligently serving this data to others. It is nearly impossible to tell if mixed data is obtained illegally though.

The following from the verified Wikipedia scraping entry where all of the cases are real says it all.

U.S. courts have acknowledged that users of “scrapers” or “robots” may be held liable for committing trespass to chattels,[5][6] which involves a computer system itself being considered personal property upon which the user of a scraper is trespassing. The best known of these cases, eBay v. Bidder’s Edge, resulted in an injunction ordering Bidder’s Edge to stop accessing, collecting, and indexing auctions from the eBay web site.

Paywalls and Product offer another significant though easy to skirt boundary. When going behind paywalls, contracts are breachable by clicking an agreement not to do something and then doing it. This is particularly damaging since You add fuel to the protection of negligence v. willingness [an issue for damages and penalties not guilt] in civil and any criminal trials. Ignorance is no defense.

Outside of the US, things are quite different. EU law and other law is way more lax. Corporations with big budgets dominate our legal landscape. They control the system in a very real way with their money in a way that they do not elsewhere, at least not as much despite being just as powerful in almost every respect.

The gist of the cases and laws seems to point towards getting public information and information that is available without going behind a pay wall. Think like a user of the internet and combine a bunch of sources into a unique product. Don’t just ‘steal’ an entire site protected site.

Here are a few examples. Trulia owns its information but you could use it to go to an agents website or collect certain information. However, accessing protected data is not legal and just re-purposing a site seems to be as well. The legal amount of pulled information is determinable. Also, a public MLS listing lookup site with no agreement or terms and offering data to the public is fair game. The MLS numbers lists, however, are normally not fair game since access is heavily guarded behind a wall of registration requiring some fakery to get to.

If a researcher can get to data, so can you. If a researcher needs permission, so do you. A computer is like having a million corporate researchers at your disposal.

As for company policy, it is usually used internally to shield from liability and serves as a warning but is not entirely enforceable. The legal parts letting you know about copyrights and such are and usually are supposed to be known by everyone. Complete ignorance is not a legal protection. It does provide a ground set of rules. Be nice, or get banned is that message as far as I know.

My personal strategy is to start with public data and embellish it within legal means.

Canny Edge Detection in Java

So you don’t really have a decent graphics card, CUDA in C or pyCuda are not options since you don’t have a NVIDIA card, or you just want something completely cross-platform without a large amount of research. Canny edge detection in straight Java does not need to be slow.

Accomplishing a faster and even memory efficient canny edge detection algorithm only requires the use of loops and the proxy design pattern. Basically, simple code applied to the theory will do the trick. All of the code is available at my github repository.

The most important initial task is actually to pre-process the image to bring out the edges that are wanted and obscure those that might get caught up in the detection. This is likely standard practice as unwanted edges are still edges. It is possible to implement a series of simple corrections such as a Gaussian or box blur, denoise, or an unsharp mask. The links in the previous sentence were not used in the following blur or my code libraries but the concept remains the same.

Gaussian Blur Example
The gaussian blur is an equation based way of getting rid of unwanted edges. The equation averages pixels within a certain window using Gauss’ statistical normal distribution equation.

This equation is implemented simply in Java by applying a window to an image.

private void blurGauss(double radius) {
		// TODO create a kernel and implement a gaussian blur
		// this is a somewhat lazy implementation setting a definite center and
		// not wrapping to avoid black space

		// the radius of the kernal should be at least 3*the blur factor
		BufferedImage proxyimage=image.getImage();
		int rows_and_columns = (int) Math.round(radius);

		while (rows_and_columns % 2 == 0 & rows_and_columns != 0) {

		while (rows_and_columns > proxyimage.getWidth()) {
			rows_and_columns = 3;
		int centerx = ((rows_and_columns + 1) / 2) - 1;
		int centery = ((rows_and_columns + 1) / 2) - 1;

		// the kernel sum
		float sum_multiplier = 0;

		/* get the kernel */
		// the base for gaussian filtering
		float base_start = (float) (1 / (2 * Math.PI * Math.pow(radius, 2)));

		// the multiplier matrix to be applied to every pixel, ensured to be one
		float[][] arr = new float[rows_and_columns][rows_and_columns];

		// the central coordinates

		for (int i = 0; i < rows_and_columns; i++) {
			for (int j = 0; j < rows_and_columns; j++) {
				float exp = 0;

				// calculate the corners
				exp = (float) -1.0
						* (float) ((Math.pow((Math.abs(i - centerx)), 2) + (Math
								.pow(Math.abs(j - centery), 2))) / (2 * Math
								.pow(radius, 2)));
				float base = (float) (base_start * Math.exp(exp));
				arr[i][j] = base;

				sum_multiplier += base;


		/* replace the values by multiplying by the sum_multiplier */
		// get the multiplier
		sum_multiplier = (float) 1 / sum_multiplier;

		// multiply by the sum multiplier for each number
		for (int i = 0; i < rows_and_columns; i++) {
			for (int j = 0; j < rows_and_columns; j++) {
				arr[i][j] = arr[i][j] * sum_multiplier;


		// blur the image using the matrix
		complete_gauss(arr, rows_and_columns, centerx, centery);

	private void complete_gauss(float[][] arr, int rows_and_columns,int centerx, int centery) {
		// TODO complete the gaussian blur by applying the kernel for each pixel
		BufferedImage proxyimage=image.getImage();
		// the blurred image
		BufferedImage image2 = new BufferedImage(proxyimage.getWidth(),proxyimage.getHeight(), BufferedImage.TYPE_INT_RGB);

		// the r,g,b, values
		int r = 0;
		int g = 0;
		int b = 0;

		int i = 0;
		int j = 0;

		// the image height and width
		int width = image2.getWidth();
		int height = image2.getHeight();

		int tempi = 0;
		int tempj = 0;
		int thisx = 0;
		int thisy = 0;
		if (arr.length != 1) {

			for (int x = 0; x < width; x++) {
				for (int y = 0; y < height; y++) {

					// the values surrounding the pixel and the resulting blur
					// multiply pixel and its neighbors by the appropriate
					// ammount

					i = (int) -Math.ceil((double) rows_and_columns / 2);
					j = (int) -Math.ceil((double) rows_and_columns / 2);

					while (i < Math.ceil((double) rows_and_columns / 2)
							& j < Math.ceil((double) rows_and_columns / 2)) {

						// sets the pixel coordinates
						thisx = x + i;

						if (thisx = proxyimage.getWidth()) {
							thisx = 0;

						thisy = y + j;

						if (thisy = proxyimage.getHeight()) {
							thisy = 0;

						// the implementation
						tempi = (int) (Math
								.round(((double) rows_and_columns / 2)) + i);
						tempj = (int) (Math
								.round(((double) rows_and_columns / 2)) + j);

						if (tempi >= arr[0].length) {
							tempi = 0;

						if (tempj >= arr[0].length) {
							tempj = 0;

						r += (new Color(proxyimage.getRGB((thisx), (thisy)))
								.getRed() * arr[tempi][tempj]);
						g += (new Color(proxyimage.getRGB((thisx), (thisy)))
								.getGreen() * arr[tempi][tempj]);
						b += (new Color(proxyimage.getRGB((thisx), (thisy)))
								.getBlue() * arr[tempi][tempj]);


						if (j == Math.round((double) rows_and_columns / 2)) {
							j = 0;


					// set the new rgb values with a brightening factor
					r = Math.min(
											+ ((int) Math.round(arr[0].length
													* arr[0].length))));
					g = Math.min(
											+ ((int) Math.round(arr[0].length
													* arr[0].length))));
					b = Math.min(
											+ ((int) Math.round(arr[0].length
													* arr[0].length))));

					Color rgb = new Color(r, g, b);
					image2.setRGB(x, y, rgb.getRGB());
					r = 0;
					g = 0;
					b = 0;

					i = 0;
					j = 0;

A matrix is generated and then used to blur the image in several loops. Methods were created to make the code more understandable.

Although an “is-a” classification is often associated with interfaces or abstract classes, the proxy design pattern is better implemented with an interface that controls access to the “expensive object.”

Steps to Canny Edge Detection
Canny edge detection takes three steps. These steps prepare the image, mark potential edges, and weed out the best edges.

They are:

  1. Blur with or without denoise and convert to greyscale
  2. Perform an analysis based on threshold values using an intensity gradient
  3. Perform hysteresis

Sobel Based Intensity Gradient
One version of the intensity gradient (read here for more depth on the algorithm) is derived using the Sobel gradient. The gradient is applied in a similar way to the blur, using a window and a matrix.

The matrix finds specific changes in intensity to discover which potential edges are the best candidates. Convolution is performed on the matrix to obtain the best result.

Perform Hysteresis
Hysteresis weeds out the remaining noise from the image, leaving the actual edges. This is necessary in using the Sobel gradient since it finds too many candidates. The trick is to weed out edges from non-edges using threshold values based on the intensity gradient. Values above and below a chosen threshold are thrown out.

A faster way to perform this, if necessary, is to try to use a depth first search-like algorithm to find the ends of the edge, taking connected edges and leaving the rest. This action is fairly accurate.

The Code
Sobel Based Intensity Gradient

private void getIntensity() {
		// TODO calculate magnitude
		 * Kernels
		 * G(x) G(y) -1|0|1 -1|-2|-1 -2|0|2 0|0|0 -1|0|1 1|-2|1
		 * |G|(magnitude for each cell)approx. =|G(x)|+|G(y)|=
		 * |(p1+2p2+p3)-(p7+2p8+p9)|+|(p3+2p6+p9)|-|(p1+2p4+p7)|blank rows or
		 * colums are left out of the calc.

		// the buffered image
		BufferedImage image2 = new BufferedImage(image.getWidth(),
				image.getHeight(), BufferedImage.TYPE_BYTE_GRAY);

		// gives ultimate control can also use image libraries
		// the current position properties
		int x = 0;
		int y = 0;

		// the image width and height properties
		int width = image.getWidth();
		int height = image.getHeight();

		// iterate throught the image
		for (y = 1; y < height - 1; y++) {
			for (x = 1; x < width - 1; x++) { 				// convert to greyscale by masking (32 bit color representing 				// intensity --> reduce to greyscale by taking only set bits)
				// gets the pixels surrounding hte center (the center is always
				// weighted at 0 in the convolution matrix)
				int c1 = (image.getRGB(x - 1, y - 1) & 0xFF);
				int c2 = (image.getRGB(x - 1, y) & 0xFF);
				int c3 = (image.getRGB(x - 1, y + 1) & 0xFF);
				int c4 = (image.getRGB(x, y - 1) & 0xFF);
				int c6 = (image.getRGB(x, y + 1) & 0xFF);
				int c7 = (image.getRGB(x + 1, y - 1) & 0xFF);
				int c8 = (image.getRGB(x + 1, y) & 0xFF);
				int c9 = (image.getRGB(x + 1, y + 1) & 0xFF);

				// apply the magnitude of the convolution kernal (blank
				// column/row not applied)
				// differential x and y gradients are as follows
				// this is non-max suppression
				 * Lxx = |1,-2,1|*L Lyy= {1,-2,1}*L ({} because its vertical and
				 * not horizontal)
				int color = Math.abs((c1 + (2 * c2) + c3)
						- (c7 + (2 * c8) + c9))
						+ Math.abs((c3 + (2 * c6) + c9) - (c1 + (2 * c4) + c7));

				// trim to fit the appropriate color pattern
				color = Math.min(255, Math.max(0, color));

				// suppress non-maximum
				// set new pixel of the edge
				image2.setRGB(x, y, color);

		// reset the image
		image = image2;


private void hysterisis() {
		// TODO perform a non-greedy hysterisis using upper and lower threshold
		// values
		int width = image.getWidth();
		int height = image.getHeight();

		Color curcol = null;
		int r = 0;
		int g = 0;
		int b = 0;

		ve = new String[width][height];

		for (int i = 0; i < width; i++) {
			for (int j = 0; j < height; j++) {
				ve[i][j] = "n";

		for (int i = 0; i < height; i++) {

			for (int j = 0; j < width; j++) { 				curcol = new Color(image.getRGB(j, i)); 				if (ve[j][i].compareTo("n") == 0 						& (((curcol.getRed() + curcol.getBlue() + curcol 								.getGreen()) / 3) > upperthreshold)) {
					ve[j][i] = "c";
					image.setRGB(j, i, new Color(255, 255, 255).getRGB());

					follow_edge(j, i, width, height);
				} else if (ve[j][i].compareTo("n") == 0) {
					ve[j][i] = "v";
					image.setRGB(j, i, new Color(0, 0, 0).getRGB());



Depth First Like Noise Reduction

private void follow_edge(int j, int i, int width, int height) {
		// TODO recursively search edges (memory shouldn't be a problem here
		// since the set is finite and should there should be less steps than
		// number of pixels)

		// search the eight side boxes for a proper edge marking non-edges as
		// visitors, follow any edge with the for-loop acting
		// as the restarter

		int x = j - 1;
		int y = i - 1;
		Color curcol = null;

		for (int k = 0; k < 9; k++) { 			if (x >= 0 & x < width & y >= 0 & y < height & x != j & y != i) { 				curcol = new Color(image.getRGB(j, i)); 				// check color 				if (ve[x][y].compareTo("n") == 0 						& ((curcol.getRed() + curcol.getBlue() + curcol 								.getGreen()) / 3) > lowerthreshold) {
					ve[x][y] = "c";
					image.setRGB(j, i, new Color(255, 255, 255).getRGB());

					follow_edge(x, y, width, height);
				} else if (ve[x][y].compareTo("n") == 0 & x != j & y != i) {
					ve[x][y] = "v";
					image.setRGB(j, i, new Color(0, 0, 0).getRGB());


			// check x and y by k
			if ((k % 3) == 0) {
				x = (j - 1);



Laplace Based Intensity Gradient as Opposed to Sobel

The Sobel gradient is not the only method for performing intensity analysis. A Laplacian operator can be used to obtain a different matrix. The Sobel detector is less sensitive to light differences and yields both magnitude and direction but is slightly more complicated. The Laplace gradient may also reduce the need for post-processing as the Sobel gradient normally accepts too many values.

The Laplace gradient uses 0 as a mask value, obtaining the following matrix.

-1 -1 -1
-1 8 -1
-1 -1 -1

The matrix is used to transform each pixel’s RGB value based on whether or not it is part of a candidate edge.

private void find_all_edges() {
		// TODO find all edges using laplace rather than sobel and hysterisis
		// (noise can interfere with the result)
		// the new buffered image containing the edges
		BufferedImage image2 = new BufferedImage(image.getWidth(),
				image.getHeight(), BufferedImage.TYPE_INT_RGB);

		// gives ultimate control can also use image libraries
		// the current position properties
		int x = 0;
		int y = 0;

		// the image width and height properties
		int width = image.getWidth();
		int height = image.getHeight();

		 * Denoise Using Rewritten Code found at
		 * http://introcs.cs.princeton.edu/
		 * java/31datatype/LaplaceFilter.java.html
		 * Using laplace is better than averaging the neighbors from each part
		 * of an image as it does a better job of getting rid of gaussian noise
		 * without overdoing it
		 * Applies a default filter:
		 * -1|-1|-1 -1|8|-1 -1|-1|-1

		// perform the laplace for each number
		for (y = 1; y < height - 1; y++) {
			for (x = 1; x < width - 1; x++) {

				// get the neighbor pixels for the transform
				Color c00 = new Color(image.getRGB(x - 1, y - 1));
				Color c01 = new Color(image.getRGB(x - 1, y));
				Color c02 = new Color(image.getRGB(x - 1, y + 1));
				Color c10 = new Color(image.getRGB(x, y - 1));
				Color c11 = new Color(image.getRGB(x, y));
				Color c12 = new Color(image.getRGB(x, y + 1));
				Color c20 = new Color(image.getRGB(x + 1, y - 1));
				Color c21 = new Color(image.getRGB(x + 1, y));
				Color c22 = new Color(image.getRGB(x + 1, y + 1));

				/* apply the matrix */
				// to check, try using gauss jordan

				// apply the transformation for r
				int r = -c00.getRed() - c01.getRed() - c02.getRed()
						+ -c10.getRed() + 8 * c11.getRed() - c12.getRed()
						+ -c20.getRed() - c21.getRed() - c22.getRed();

				// apply the transformation for g
				int g = -c00.getGreen() - c01.getGreen() - c02.getGreen()
						+ -c10.getGreen() + 8 * c11.getGreen() - c12.getGreen()
						+ -c20.getGreen() - c21.getGreen() - c22.getGreen();

				// apply the transformation for b
				int b = -c00.getBlue() - c01.getBlue() - c02.getBlue()
						+ -c10.getBlue() + 8 * c11.getBlue() - c12.getBlue()
						+ -c20.getBlue() - c21.getBlue() - c22.getBlue();

				// set the new rgb values
				r = Math.min(255, Math.max(0, r));
				g = Math.min(255, Math.max(0, g));
				b = Math.min(255, Math.max(0, b));
				Color c = new Color(r, g, b);

				image2.setRGB(x, y, c.getRGB());
		image = image2;


The following before and after pictures show the results of applying the Sobel based matrix to the image and using the depth first search like approach to hysteresis.

cross-country before


edge detected

Python Based Edge Detection

In Python, OpenCV performs this operation in a single line.

import cv2


Morning Joe: Big Data Tools for Review

Big data is a big topic and many tools are starting to Spring up. However, implementations of the old SQL standard are falling far behind these tools. My recent post on using a Fork Join Pool in insertion and pulling data can help if using PostgreSQL but also destroys bandwidth. It cause headaches for everyone whenever I need to do some serious checking of my ETL,parsing, normalization, data work, or other processes outside of our co-location. Although this is nothing new, I’ve compiled a list of tools to go forth and make little rocks from big rocks all day.

I’ve found some companies and a tool for a big review later on, this is my prework post while my IDE starts up. So far, I have found technologies for SQL databases that use Hadoop,Fractal Trees, and Cassandra to speed up the process. They are not focused specifically on speed but can help create faster database access and lower coding time.

What I’ve found so far:

  • Cassandra (open source): promises scalability and availability alongside a plethora of features (maybe for implementing other tools)
  • Oracle Data Integration Adapter for Hadoop: promises the speed of hadoop connected to an Oracle database
  • BigSQL (open source):promises to combine Cassandra,PostgreSQL, and Hadoop into a blazing fast package for analsis.
  • MapR technologies (somewhat open source): offers a wide variety of products to improve speed in querying and analysis from Hive and map reduce, to actual hadoop
  • Fractal Tree Indexing (open source): Tokutech’s fractal tree indexing speeds up insertions using buffers on each tree node
  • Alteryx: a tool for quicker data processing though not quite as fast as the others (good if your budget does not allow clusters but allows something better than Pentaho
  • MongoDB (open source): Combines map reduce and other technologies with large databases. Tokutech tested its fractal tree indexes on MongoDB
  • Pentaho (open source): The open source version of Alteryx

Many of these tools are already implemented in others such as Pentaho. Personally, I would like to see a SQL-like language that uses these tools alongside a query processor. It would make the tasks even faster to write, think Java v. Python. You could have 10 lines of map reduce code, 5 minutes of click and drag, or a 1 line easy-going query that writes as you think.

To be clear, I am not ranking these, only marking them for future review since this is what has piqued my interest today. Cheers!