Lack of Nested and Implicit Data Support in Drill, PostgreSQL and Pentaho when Working with Json Data

JSon is great. It can contain a variety of data types in an expected format. It is becoming easier and easier to work with Json in existing formats as well making it a future workhorse for NoSQL based ETL. However, and not in the least because NoSQL ingestion needs to result in relational tables using SQL standards, there is still one bug to work out. Ingestion with Json will not break out nested tables and requires direct knowledge of data to complete tasks.

This may seem petty but when millions of recods are being read, it clearly is not.

In drill, this could potentially be overcome by creating a table for every single submap we wish to analyze but CREATE TABLE from the tool itself will not work. Instead, it is necessary to limit use cases to the data we want to use.

In PostgreSQL, it is possible to concatenate JSon data using a query whose individual results can then be saved. It is also possible to ‘pop’ keys that are unneeded. However, this approach requires many different tables at one per somewhat normalized form. It also requires recombining data.


SELECT row_to_json(r.*) FROM (SELECT nonJson AS nonJson, ((((data->'level1')::jsonb - 'bkey1')::jsonb - 'bkey2')::jsonb -'bkey3')::jsonb AS jdata FROM table WHERE data::jsonb ?| array['mustHaveKey'] AND data::jsonb ?| array['notHaveKey'] IS FALSE) r

Drill is still much more ahead of the game than Pentaho and PostgreSQL in terms of conversion though. Postgresql can guess types but has no function to attempt to automatically generate tables. Pentaho requires explicit conversion as well.

Of course, if one already knows every key that will be present, this is not a problem. That, however, means more maintenance as it is then impossible to write programs to automatically handle changes to data. Perhaps implicit conversion will happen soon but any argument as to data type issues should really look at the depth of the standards and conform to them instead of complaining.

Open Source Data Science, the Great Resource No One Knows About

There is a growing industry for online technology courses that is starting to gain traction among many who may have been in school when certain fields like data science were still the plaything of graduate students and phds in Computer Science, statistics, and even, to a degree, biology. However, these online courses will never match the pool of knowledge one could drink from by even taking an undergraduate Computer Science or mathematics class at a middling state school today (I would encourage everyone to avoid business schools like the plague for technology).

In an industry that is constantly transforming itself and especially where the field of data will provide long-term work, these courses may appear quite appealing. However, they are often too shallow to provide much breadth and just thinking that it is possible to pick up and understand the depth of the 1000 page thesis that led to the stochastic approach to matrix operations and eventually Spark is ridiculous. We are all forgetting about the greatest resources available today. The internet, open source code, and a search engine can add layers of depth to what would otherwise be an education not able to provide enough grounding for employment.

Do Take the Online Courses

First off, the online courses from Courses from Coursera are great. They can provide a basic overview of some of the field. Urbana offers a great data science course and I am constantly stumbling into blogs presenting concepts from them. However, what can someone fit into 2-3 hours per week for six weeks in a field that may encompass 2-3 years of undergraduate coursework and even some masters level topics to begin to become expert-level.

You may learn a basic K Means or deploy some subset of algorithms but can you optimize them and do you really know more than Bayesian probabilities that you likely also learned in a statistics class.

Where Open Source Fits In

Luckily, many of the advanced concepts and a ton of research is actually available online for free. The culmination of decades of research is available at your fingertips in open source projects.

Sparse Matrix research, edge detection algorithms, information theory, text tiling, hashing, vectorizing, and more are all available to anyone willing to put in the time to learn them adequately.

Resources

Documentation is widely available and often on github for:

These github accounts also contain useful links to websites explaining the code, containing further documentation (javadocs), and giving some conceptual depth and further research opportunities.

A wide majority of conceptual literature can be found with a simple search.

Sit down, read the conceptual literature. Find books on topics like numerical analysis, and apply what you spent tens or even hundreds of thousands of dollars to learn in school.

Smoothing: When and Which?

Smoothing is everywhere. It is preprocessing for signal processing, it makes text segmentation work well, and it is used in a variety of programs to cull noise. However, there are a wide variety of ways to smooth data to achieve more appropriate predictions and lines of fit. This overview should help get you started.

When to Use Smoothing

The rather simple art of smoothing data is best performed when making predictions with temporal data where data comes from one or more potentially noisy sources (think a weather station with a wind speed monitor that is a bit loose or only partially unreliable) and dealing with tasks such converting digital sound to analogue waves for study. When a source appears capable of making decent predictions but is relatively noisy, smoothing helps. It is even used in fitting smoothing splines.

Types of Smoothing to Consider

Now that the rather blunt explanation is out of the way, the proceeding list is based on my own use with text mining and some sound data. I have tried to include the best resources I could for these.

  • Rectangular or Triangular: Best used for non-temporal and fairly well fitting data where more than past concerns are important (text segmentation is an example).
  • Simple Exponential: Creates a moving average smoothing and considers past events. Is not a terrific smoothing tool but is quick and works well when data is correlated (could work well with sound which may be better with a Hamming or Hanning Window). Unlike double and triple exponential smoothing, the algorithm requires no past experience or discovery to do correctly, for better or worse.
  • Double and Triple Exponential Smoothing: Works well with time series data. Peaks and valleys are more preserved. Triple exponential Smoothing works well with seasonal data. They require some manual training or an algorithm relying on previous experience to generate an alpha value to perfect. Again, past events are more heavily weighted.
  • Hanning and Hamming WindowsPeriodic data may work well with this type of smoothing (wave forms). They are based on the cosine function. Past experience is not needed. For really noisy data, try the more intensive Savitsky-Golay filter.
  • Savitzky–Golay: This smoothing works fairly well but preserves peaks and valleys within the broader scope of the data. Savitsky-Golay is not ideal for a truly smooth curve. However, if some noise is really important, this is a great method. Its publication was actually considered one of the most important by Analytical Chemistry for spectral analysis. It uses a localized least squares technique to accomplish its feat.

    However, do not rely on the matrix based calculation for OLS as the most efficient as gradient descent is clearly the winner. No self-respecting programmer will use a matrix implementation on large data sets. Spark contains an optimized gradient descent algorithm for distributed and even single node programming. The algorithm is tail recursive and seeks to minimize a cost function.

  • Distribution Hint

    For the programmer or developer looking to distribute a Savitsky-Golay calculation and not using Spark gradient descent. Map partitions works well on the local areas. It also works well when smoothing many splines or for the Hanning and Hamming Window based smoothing.

    Secure Your Data Series: Why a Captcha Alone Fails

    Since captchas are meant to be unreadable by a computer, they are a great tool for better learning the task of OCR. As even Google now admits, Captcha’s are breakable. This is more concerning from a security standpoint, revealing that even an open source OCR like Tesseract can defeat this system. A little computer vision and some basic pre-processing in python will break most Captchas. That is why I champion the use of a mapping and analysis of click stream data with Latent Dirichlet Allocation to classify human from non-human or hacker from non-hacker (stay tuned its coming). Adding the LDA approach to a captcha system with a higher probability of failure for automated processes, guessing here, and use of click stream data to form vectors (literal mathematical vectors) and security becomes a lot better.

    Let’s do some Captcha breaking but beware this is purely educational and not for breaking the law! Many Captchas have sound options to comply with handicap laws of which simpler puzzles can be broken with sound recognition such as Sphinx4. However, the dilution of the sound in modern Captchas can make OCR useful for aiding the disabled. Basically, there are uses of this code that are likely to remain legal as companies look to narrow the definition of authorized access.

    Image Preprocessing

    Captcha images contain all sorts of clutter and manipulations with the goal of eliminating readability. This makes pre-processing critical to the goal at hand. Speed is the crucial consideration in this task so any library or custom code needs to be extremely efficient.

    Two modules exist in Python that help with preprocessing. They are OpenCV (cv2) and pillow (PIL). My image toolset can also be used for these tasks.

    OpenCV is a powerful open source library with the aim of making a lot of calculus and differential equation based code for computer vision incredibly easy to deploy. It runs extremely quickly. The modules are largely written in C and there is also a C++ API. OpenCV is great if you want to write custom code too as the tutorials also dive deeply into the mathematics behind each program. For this case, classes from cv2 including resize (which goes further than basic expansion),Canny edge detection, and blurring are highly effective. After writing the classes in Java and even using a graphics card library in python to do the same tasks, it appears that OpenCV matches or is only slightly slower than the custom code. The images are small though.

    Other modules are incredibly good at performing contouring, corner detection, and key point finding. If you have a computer vision or artificial intelligence task, Open CV is the go-to API.

    import cv2
    

    For basic pre-processing, pillow is also an incredibly fast library. Again, compared to my own Java code, the modules work at about the same speed. The idea behind them is the use of kernels, small matrices filled with weights that can be used to distribute color in an image via a window.

    from PIL import ImageEnhance
    from PIL import ImageFilter
    from PIL import Image
    

    All of the necessary pre-processing, whether custom or module based can be completed in less than one second, producing the result shown below. However, it is necessary to fiddle with the images until they look as close to they way a normal input would.

    Overall, the total time taken to break a captcha ranges from roughly one second or less to four seconds on a dual core machine with 4gv of RAM. Completing tasks with custom code may improve speed when using faster matrix libraries but numpy is fairly efficient in today’s world.

    One extremely useful trick is to resize the image and improve contrast.

        img=Image.open('captcha.jpg').convert('L')
        img=np.array(img,np.uint8).copy()
        img=cv2.resize(img,(img.shape[1]*2,img.shape[0]*2)) 
        img=Image.fromarray(img)
    
        enhancer=ImageEnhance.Contrast(img)
        enhancer.enhance(2)
    

    If using numpy, there is an extremely useful way to apply a function to all pixels of an image as well using some Python magic.

       numpyimage[numpyimage<128]=0
    

    Decluttering with Statistics

    Certain transforms and other techniques may leave unwanted clutter in the image. Ridding small or abnormally sized objects from an image is performable with basic statistics. Remember that 1.5 standard deviations [sum(x-mean)^2/n] is a normal outlier and 3 standard deviations is an extreme outlier. This can be used to eliminate elements that are longer than others. The example below follows an object and eliminates it based on width and has proven successful. If vertical objects are present, surface area coverage may be a better consideration. These work better than contouring here because the images are not always connected properly. They need to be readable by a human and not a computer.

    def declutter(self,inarr):
            """Declutter an Image"""
            arr=inarr
            height=len(arr)
            width=len(arr[0])
            
            total=0
            ws=0
            j=0
            avg=0
            account=False 
            wsarr=[]
            
            #get the avg, total
            for i in range(height):
                for c in arr[i]:
                    if c < 128 and account is True:
                        account=False
                        total+=1
                        avg+=(j-ws)
                        wsarr.append((j-ws))
            avg/=total
            sd=0
            
            #calculate sd
            for n in wsarr:
                sd+=((n-avg)*(n-avg))
            
            sd=math.sqrt((sd/((total-1))))
            eo=1.5*sd
            o=sd
            
            ws=0
            #perform declutter
            for i in range(height):
                for c in arr[i]:
                    if c128 and account is True:
                        account=False
                        
                        if (j-ws) > (avg+o) or (j-ws) <(avg-o):
                            for j in range(j-ws):
                                arr[(ws+j)]=255
                            total+=1
            print str(total)+" objects removed"
            
            return (arr,total)       
    

     

    Rotating with the Bounding Box and an Ode to OpenCV

    In order to complete our work, it is a good idea to know how to find the minimum bounding box and rotate the image. This is made difficult in many instances by the fact that the letters in a Captcha are not always continuous black lines. Thankfully, OpenCV contains a multitude of feature detectors that can help outline or find key features for the different objects.

    My initial attempts at this involved a gaggle of different methods. After attempts at using a threshold with Euclidean distances to produce a set further reduced by basic statistics revolving around centroids, and a few other methods, I arrived at contours. The Euclidean distance method would likely work on regular text lines but, here, characters are intentionally smashed together or unwanted lines mixed in. I kept getting double letters with different Captchas.  The result of using these methods is the feeling of frustration.

    In contrast to the feeling of being lost, OpenCV’s contour class can find angles, bounding boxes that are rotated, and many other useful features. Contours rock.

    A contour basically uses defined edges. Algorithms include using Latent Dirichlet Allocation to group objects found from edge detection and marching squares with defined edges following certain angles and patterns, much like a Haar cascade in a way. Obviously any AI implementation is much stronger.

       def getContours(self,img):
            """
            Get Contours from CV2
            """
            hsv_img=cv2.cvtColor(img,cv2.COLOR_BAYER_BG2BGR)
            hsv_img=cv2.medianBlur(hsv_img,5)
            sv_img=cv2.cvtColor(img,cv2.COLOR_BAYER_BG2BGR)
            hsv_img=cv2.medianBlur(hsv_img,5)
            kernel=np.ones((5,5),'uint8')
            hsv_img=cv2.dilate(cv2.erode(hsv_img,kernel,iterations=3),kernel)
            
            COLOR_MIN=np.array([255,255,255],np.uint8)
            COLOR_MAX=np.array([255,255,255],np.uint8)
            
            frame_threshed=cv2.inRange(hsv_img,COLOR_MAX,COLOR_MIN)
            #img=cv2.imread("/home/aevans/Documents/testimages/ElPasoAOC/elpasoAOC_1424991951.2_14.jpg",0)
            #ret,thresh=cv2.threshold(img,127,255,0)
            ret,thresh=cv2.threshold(frame_threshed,127,255,0)
            return cv2.findContours(thresh,cv2.RETR_TREE,cv2.CHAIN_APPROX_SIMPLE)    
    

    With the contours found, it is a good idea to discover any patterns in spacing and use those to combine the boxes with an expanded intersection condition and a while loop.

    Now, we can rotate our individual letters. The equation here is simple and the code can be written easily. The following kernel is useful for rotation about the center of an image.

    The code for the equation in scipy is as follows. It is possible to use the above matrix on each pixel to discover the new location as well. I wrote the kernel into the sample.

       def rotate(self,img,cnt):
            '''
            Utilizes contouring via cv2 to recognize letters and rotate them.
            Takes in a numpy array to work with and returns a set of numpy arrays.
            This will rotate on a provided contour.
            
            *Required Parameters*
            
            :param img: image as a numpy array
            :param cnt: contour to rotate on
            '''
            im=img
            try:
                #get the basic points
                box=np.int0(cv2.boxPoints(cv2.minAreaRect(cnt)))
                x0,y0=box[0]
                x1,y1=box[1]
                        
                
                #get the rotational angle
                degree=-1*(180*math.atan((y1-y0)/(x1-x0)))/math.pi
                
                print "Rotational Degrees: "+str(degree)
            except:
                return im
            
            #rotate with interpolation in scipy
            return ndimage.rotate(im,degree,mode='nearest',cval=100)
    

    Another, more powerful rotation uses the same ndimage (whose cropping algorithm likely interpolates much better than simply multiplying [(x,y),(x1,y1)] by [(x),(y)] and crops. This uses a best fit line with slope found using one of a variety of least squares calculations. The cv2 function used returns vectors collinear to x and y with length one in the other place.

       def rotateFromPixels(self,image,cnt,tolerance=(10*math.pi)/180):
            """
            Takes in contour information and rotates based on
            the line of best fit through contour points. I discovered
            that this could be done after getting the original box though
            the box may reduce the effect of large skews.
            
            This program also considers the minimum area rectangle in 
            case a letter is actually skewed only stretched (skew not detectable from box). I'm guessing
            that these will not always be the same since a statistical best fit
            is not always the same as finding the absolute max and min points.
            
            *Required Parameters*
            :param image: the image to rotate
            :param cnt: the contour information
            
            *Optional Parameters*
            :param tolerance: the allowable tolerance (deviation from 0 degrees in radians)
            """
            d90=(90*math.pi)/180
            box=np.int0(cv2.boxPoints(cv2.minAreaRect(cnt)))
            x0,y0=box[0]
            x1,y1=box[1]
            print str(math.atan(y1-y0/x1-x0))
            boxpheta=math.atan(y1-y0/x1-x0)
            image2=image
            print "BoxPheta: "+str(boxpheta)
            
            if abs(boxpheta-d90) > tolerance:
                    
                [vx,vy,x,y]=cv2.fitLine(cnt,cv2.DIST_L2,0,0.01,0.01)
                
                slope=-1*(vx/vy)
                
                #find the perpendicular slope to the given slope
                
                i=1
                
                if vx[0] is 0 and vy[0] is 1:
                    return image2
                else:
                    print "Slope Points: "+str(vx[0])+","+str(vy[0])
                    print "Slope: "+str(slope)
                    
                    Image.fromarray(image2).show()
                
                    pheta=math.atan(slope)
                    
                    print "Pheta: "+str(pheta)
                    print "Pheta (degrees)"+str((pheta*180)/math.pi)
                    i+=1
                    time.sleep(random.randint(2,4))
                    print "\n\n\n\n\n"
                if vx[0] >0:
                    image2=ndimage.rotate(image2,(pheta*180)/math.pi,mode='nearest',cval=100)
                    Image.fromarray(image2).show()
            return image2 
        
    

    If rotation is necessary, it may be possible to stick letters in a set and then attempt to read them individually rather than stitch an image together. Since a white background is present, it is also possible to rotate the letters and stick them back into the image. Please read the alignment section with code regarding alignment to see how this is done.

    Alignment

    Aligning individual images is normally necessary, especially for systems considering shapes that use LDA or a similar AI task to learn different groupings. If a u is too far above a proceeding letter like an S, the response from the system may be ” instead of u.

    Using cv2, it is not difficult to align letters that are nearly good enough to run through OCR. Open CV includes powerful contouring tools based on machine learning techniques. Contouring allows individual letters to be discerned, allowing for even more techniques to be applied such as a rotational matrix to the image matrix as described in rotation.

    The proceeding code exemplifies the process of alignment. It does not, however, consider whether a letter has a stem.

       def alignLetters(self,img,maxBoxArea=None,minBoxArea=None,printBoxArea=False,imShow=False):
            """
            Align Letters in a Pre-Processed Image. Options are provided
            to limit the size of accepted bounding boxes, helping eliminate 
            non-contours and the usual box covering the image as a whole.
            
            Returns a PIL Image.
            
            *Required Parameters*
            
            :param img: numpy array image to use
            
            
            *Optional Parameters*
            
            :param maxBoxArea: maximum bounding box area to accept (None is a non check)
            :param minBoxArea: minimum bounding box area to accept (None is a non check)
            :param printBoxArea: boolean to state whether to print box areas (list trimmed if maxBoxArea or minBoxArea set)
            :param imShow: show the individual contoured images and final image
            """
            #setup image
            width=img.shape[1]
            height=img.shape[0]
            
            image,contours,hierarchy=self.getContours(img)
            
            idicts={}
            istarts=[]
            
            maxheight=0
            maxwidth=0
            numboxes=0
            
            #get the contours and bounding boxes and filter out bad bounding boxes
            for cnt in contours:
        
                x,y,w,h=cv2.boundingRect(cnt)
                
                #obtain only bounding boxes meeting a certain criteria
                if (maxBoxArea is None or w*hminBoxArea):
                    if printBoxArea is True: 
                        print str(w*h)
                    
                    numboxes+=1
                    
                    if h>maxheight:
                        maxheight=h
                    
                    if w>maxwidth:
                        maxwidth=w
                    
                    istarts.append(x)
                    
                    image2=img[y:y+h,x:x+w]
                    idicts[x]=image2
            
            istarts=list(sorted(istarts))
            endImg=Image.new('L',(width,height),'white')
            
            #write each box to a new image aligned at the bottom
            minHeight=sys.maxint
            maxWidth=0
            for x in istarts:
                imgx=idicts[x].shape[1]
                imgy=idicts[x].shape[0]
                img=Image.fromarray(idicts[x])
                
                if imShow is True:
                    img.show()
                    
                if x+imgx>maxWidth:
                    maxWidth=x+imgx
                    
         
                for i in range(x,x+imgx):
                    height=maxheight-1
                    iheight=imgy-1
                    miny=height-imgy
                    
                    if minyminy:
                        endImg.putpixel((i,height), img.getpixel((i-x,iheight)))
                        height-=1
                        iheight-=1
            
            #trim
            endImg=endImg.crop((0,minHeight, maxWidth,maxheight))
            
            if imShow is True:
                endImg.show()
            
            return endImg
    

    Some Final Processing

    Your image should either be split up and stored in a set of characters or look fairly discernable by this point. If it is not, take the time to do some final pre-processing. Using pillow (PIL) to expand edges and eliminate disconnects is one important task for final processing. Try to make the image look like newsprint.

    Tesseract: The Easy Part

    Now, the image is ready for Tesseract. The command line code can be run via the subprocess library from a temporary file or a pipe can be established directly to Tesseract. The OCR splits the letters into key features, clusters on them, and then either compares the offsets of characters it finds to a letter set or runs a co-variance comparison. I personally will test the co-variance approach in a later article and am building a massive training set.

    If you have a set of letters, be sure to use the -psm option and set it to 10. This tells Tesseract to perform single character analysis.

    Running Tesseract is simple with pytesser.

      import pytesser
      from PIL import Image
      response=pytesser.image_to_string(Image.open('image.jpg').convert('L'),cleanup=True)
    

    Please. Do not Spam. I am not responsible for the misuse of this Code. This is purely educational!!!
    P.S.

    For shits and giggles, here is an LDA algorithm that can be used with a covariance matrix and your own really,really big set of letters from sci-kit learn. Python really is excellent for AI, much easier to use than Java if you want infinite matrix size thanks to gensim,numpy, and sci-kit learn. Dare I say numpy is even faster at matrix solving than almost all Java packages when not using a tool like Mahout due to the use of BLAS and LaPack when possible. That is in part why I used Tesseract. The other is that goolge runs or ran a Captcha service while also supplying this really amazing OCR product.

    Morning Joe: Categorizing Text, Which Algorithms are Best in What Situations

    There is a ton of buzz around data mining and, as always, many new names being injected into the same topics showing a lack of study. While the buzz exists, knowing when to deploy an algorithm can be tricky. Based on a deep dive into the subject recently and a dire need to program these algorithms in Java, I present a brief overview of tools with benchmarks and examples to hopefully follow later. Basic concepts are presented first followed by some algorithms that I have not really honed yet (benchmarks are not feasible for now but will be incredibly soon).

    To read this thoroughly, I highly recommend following the links. This is a starting point and is also a summary of what I have found to now. Basically, the goal is to help avoid the hours it takes to find information on each and do some stumbling by reading this document.

    A word of caution, effective matrix based solutions use a large amount of data. Other fuzzy algorithms exist for discovering relate-ability between small sets of items. For strings, there is distance matching such as Jaro-Winkler or Levenshtein with rule based comparisons and lookup tables to minimize error (say between Adama and Obama). Statistics can enhance this process if there is a need to take the best rating. Train a model for discovering whether the hypothesis that two entities distances makes them the same as opposed to the null hypothesis that it does not after filtering out some common issues.

    The Matrix

    Linear algebra is a critical foundation of text mining. Different elements are thought of as equations When we have different documents or images, each document or image is often considered to form an equation. This equation can then be presented in a matrix, a simple and easy way to get rid of thinking in really complex terms.

    You may have heard of the vaulted differential equation. If not, some reading is in order from a site I used in college when there was not time for the book. The basis of a large portion of differential equations can be written in a matrix. This is important due to the eigen vector and eigen value. To be sure these concepts are crucial for solving matrices to find equations that explain a set of models. Drexel Universities eigenfaces tutorial provides a fairly solid understanding of the way a matrix is used in most text mining. However, similarity ratings are used to compare documents rather than a co-variance matrix for most tasks.

    The end result of studying these methods is the ability to look under the hood at today’s hottest text mining technologies.

    The Vector Space and TF-IDF

    Understanding vectors and vector operations is another crucial step to understanding the mining process. A basic vector is a set of points representing a position in these planes . Vectors can be added, subtracted,multiplied, and, most importantly, stuffed in a matrix where their points can be used with basic linear algebra to find relationships.

    Vectors have magnitude and distance. The angles and distances can be compared. Note that, while data loss may be present in finding the right magnitude and distance, the units used should be the same (it would be a terrible idea to think in terms of say millisecond-meter-documents-ice cream cones), it provides a sane basis for considering data. It is up to the miner to choose the most representative points for use in the vector.

    In text mining, term frequency-inverse document frequency rating is used in many commercial algorithms including search engines. If the name is not enough, it is basically a frequency ratio based on the ratio of individual document . It works best on more than one document and an offset of 0.5 for term frequency helps offset the effect of large documents slightly by bumping up the rating. Inverse document frequency utilizes a logarithm function to ensure that the rating remains between 0 and 1.

    Multiply the following equations together to find the result as described by Wikepedia:

    Similarity Ratings

    No matter what you do, similarity ratings are the key to making the process work. There are a several that can be used. If the data can be represented fairly well, co-variance is an option . However, text data is not that well suited to using co-variance. This is due to varying styles that represent the same object and, most importantly, issues with quantization. Natural language is naturally fuzzy. Therefore, cosines usually offers a much better solution.

    This cosines equation takes the product of two vectors or the sum of two vectors and divides by the product of there normalized vectors or the sum of their normalized vectors. It follows from vector algebra. The result is an angle representing the ‘degree’ of similarity. This can be used for comparison.

    Word Net, Disambiguation, and Stemming

    The process of disambiguation and stemming are crucial to text mining. There are many sentence processing methods as NLTK shows. At their core is WordNet and other dictionaries. WordNet is a freely available graph of an english dictionary. Most tools work with WordNet for finding root words, disambiguation, and cleaning.

    Part of Speech or POS tagging is involved in both disambiguation or stemming. Maximum entropy models are used to discover a part of speech based on common usage.

    Disambiguation attempts to resolve words with multiple meanings to their most probable meaning. The worst algorithm is original Lesk but involves only the use of WordNet. Accuracy hovers around 50 percent. Simplified Lesk achieves better results. Lesk finds overlapping words and frequencies to determine the best synonym to replace an ambiguous word. The better algorithms try to use clustering bayes for word sense discovery. Cosines may be used to improve Lesk as well.

    Stemming reduces words to their roots. Most WordNet tools use existing classifications with POS tagging to achieve this result.

    A Note on Regression Models

    Lets be certain, prediction is not well suited to categorization. Changes in word choice across a large number of documents and decisions on importance do not always mean the same thing. Therefore, regression models tend to work poorly. The data is not likely continuous as well. Think of writing like a magnetic field with eddy currents. Predicting the effect due to an encounter with these currents is really, really difficult. Basically, run into an eddy current and you are going to have a really, really bad day. That is not to say that an equation can be created that fits most of the data with respect to location of a point, basically a differential equation. It will likely not be generic and be incredibly difficult to find.

    Regression works well on continuous and more natural events.

    Classification Tree and a Random Forest

    Another often poor performer in categorization of text data is the classification tree. They are as good as the number of rules you are willing to create. However, they may be combined with multinomial Bayes for writing that is uniform and professional (say a legal document) to achieve some success. They are particularly useful after filtering data using LSA/HDP or Multinomial Bayes with decisions that work like a bayesian model when thinking about the bigger picture.

    Basically, a classification tree uses probabilities within groupings to ascertain an outcome and moving down to the appropriate left or right child node based on a yes or no response to the question do you belong?

    This process works well with defined data when there is a good degree of knowledge about a subject (say gene mapping) but text mining often uses fuzzy data with multiple possible meanings and disambiguation is not entirely accurate, Lesks original algorithm only acheived 50 percent accuracy and an LSA model hovers between 80-90%. Improving the quality can be done with multiple trees or possibly by training off of an extremely large set using cosines instead of raw frequency.

    There are multiple methods for tree creation, two are random forests and bagging. Bagging takes multiple trees and averages the probabilities for decisions, using this average for their respective nodes. Random forests find random subsets of features, find probabilities based on them, and select the stronger predictor for a node. The latter approach is best with a much larger set of known features. The number of forests is usually the square root of the total number of features.

    Again, the features must be known and fairly homogeneous. Text data is often not.

    Multinomial Bayesian Classifier

    Multinomial Bayesian Classification is a method that classifies data based on the frequency of words in different categories and their probabilities of occurrence. It is fairly straightforward, find the frequencies or train a set of frequencies on a word by word or gram by gram (a gram being an n-pairing and thus n-gram of words), find probabilities by sentence, take the best one.

    MNB works well when writing differs starkly, say with subject matters that differ greatly. It is good for tasks such as separating spam from policies and code in html data when large amounts of training data are present.

    Clustering with LSA or HDP

    Clustering works well when something is known about the data but categorization is not well done manually. Most algorithms avoid affinity propagation which usually uses the square root of total inputs as the number of clusters any way. Matrices are used heavily here as eigen values and eigen vectors derive an equation that can be used to find relate-ability between documents.

    LSA uses raw frequencies or more effectively cosines in the same manner as eigen faces to compare vectors. The end result, however, is an equation representing a category. By matrix inversion and multiplication, all elements are compared. In this case each ij entry in the matrix is a cosine or frequency of a word in a document. HDP (hierarchical direchlet process) is similar but attempts to learn more about the results and improve on the process. It takes much longer than LSA and is experiemental.

    If trying to discover new information about text or trying to find the best fit of a number of categories, these methods are useful.

    Maximum Entropy Models

    Maximum entropy models work well on heterogenous data in a manner similar to Bayes. Gensims sentence tagger classifies sentences from non-sentences in this way. The models find entropy using the maxent principle which uses frequencies and likelihood of occurrence to find outcomes. It works quite well with the correct training sets.

    If conditional independence is not assumable and nothing is known about a set, this model is useful. Categories should be known beforehand.

    Tools

    Java

    Python

    Common Resources

    A New Project: Is a Distance Based Regular Expression Method Feasible?

    So, I would like to find specific data from scraped web pages, pdfs, and just about anything under the sun without taking a lot of time. After looking over the various fuzzy logic algorithms such as Jaro-Winkler, Metaphone, and Levenstein and finding that one did not have an incredibly wide application, I decided that developing a regular expression based distance algorithm may be more feasible.

    The idea is simple, start with a regular expression, build a probability distribution across a good and known data set or multiple data sets, and test for the appropriate expression across every web page. The best score across multiple columns would be the clear winner in this case.

    Building out the expression would be include taking known good data and finding a combination between the base pattern and the data that works or building an entirely new one. Patterns that appear across a large proportion of the set should be combined. If [A-Z]+[\s]+[0-9]+[A-Z], and [A-Z]+[\s]+[0-9]+, appears often in the same or equivalent place or even [a-z]+[\s]+[0-9]+, then it should likely be [A-Z\s0-9a-z]+, if the set is similarly structured. Since the goal is to save time in programming regular expressions to further parse Xpath or other regular expression results, this is useful.

    The tricky part of the project will be designing a similarity score that adequately equates the expressions without too many outliers. Whether this is done with a simple difference test resulting in a statistical distribution or a straightforward score needs to be tested,

    In all likelihood, re-occurring words should be used to break ties or bolster weak scores.

    The new project will hopefully be available on Source Forge for data integration and pre-curation processes.

    Avalanche Data Part I: Collecting and Analyzing Public Information for Patterns

    It has been a goal of mine for a while to collect and analyze publicly available avalanche data to discover patterns and raise awareness. My home state of Colorado is a disastrous combination of climate, tourists, newcomers, and testosterone junkies with a varying degree of IQ who perform little to know thought before jumping on our Continental slopes. The result can be 20 fatalities each winter season. While the availability of public information is appauling, I did manage to wrangle together a large database with hundreds of incidents, millions of weather records, and a variety of locations across many different states.

    As of today, this data is available via post or by visiting my website, be kind as I will ban people bogging down my small amount of resources and may even retaliate. Use wireshark or Firebug to decipher the request format.The port will hopefully go away once I set up Apache, port forwarding is not allowed by my hosting service and I needed a bizzarre install of Tomcat that is missing the configuration file with authbind.

    My goals for this project were simple, establish an automated network for data collection, ETL, and the combination of the data which is placed in a relational database. That database is then analyzed using a set of open source tools and custom code for statistical analysis from Apache Commons Math for clustering and some analysis.

    Attributes I Needed and What I Found

    I wished for prediction so I needed everything from crystal type to weather patterns. Avalanche, crown, base layer type, depth, path size, destructive force, terrain traps, and a variety of other factors are important. Regression testing on what I did receive showed previous storms,terrain traps, and the base layer to be the most important factors for size and destructive force.

    However, this data was dirty, not cleanable with great expense, and somewhat unreliable. Only two sites reliably reported crown depth, width, and even base layer. Agencies are likely not forthcoming with this information since it relates directly to sales revenue.

    Only the weather data, which can be acquired from many government sites was forthcoming.

    Web Framework

    I decided to use Tomcat as the web framework to deploy my WAR. This is only my second attempt at Spring. Tomcat is an incredibly fast framework as evidenced by my site. Spring is an incredibly decent framework for handling requests, requiring much less code when set up than most other frameworks. In particular, the Request handling is of significant benefit. Post requests are handled with GSon.

    Below is a basic request mapping:

            @RequestMapping(value = "/", method = RequestMethod.GET)
    	public String home(Locale locale, Model model) {
    		//The Request Mapping
    		ServletRequestAttributes requestAttributes = ((ServletRequestAttributes) RequestContextHolder.currentRequestAttributes());
    		String ip=requestAttributes.getRequest().getRemoteAddr();
    		
    		//my ip should be all zeros for local host at the moment and so I need to change it 
    		if(ip.contains("0:0:0:0:0:0:0:1")||ip.contains("127.0.0.1")){
    			//ip detection
    		}
    		
    		
    	
    		ClassPathXmlApplicationContext ctx=new ClassPathXmlApplicationContext("avy.xml");
    		
    		GeoDBPath gpath=(GeoDBPath) ctx.getBean("GPS");
    		
    		GeoServicesDB geodb=GeoServicesDB.getInstance(gpath.getFpath());
    		ArrayList coords=geodb.getGPS(ip);
    		
    		double latitude=coords.get(0);
    		double longitude=coords.get(1);
    		
    		AvyDAO avyobjs=(AvyDAO) ctx.getBean("AvyDAO");
    				
    		
    		List avs=avyobjs.findAll(latitude,longitude);
    		String outhtml=""; 
                    //table head 
                      int i=0; 
                      for(Avalanche av: avs){ 
                        if((i%2)==0)
                       { 
                         //my table 
                        } else{ 
                         //my table 
                        } i++; 
                   } 
                   //end table 
                   model.addAttribute("avyTable",outhtml.replaceAll("null","")); return "home"; 
              }
    

    The Tools That Worked

    Standard deviations, elementary statistics, and other basic statistics are handle=able using custom code. Fast clustering algorithms and more complex math that can be made more efficient is completed well with Apache’s common math.

    Clustering is of particular interest. Commons math does not have affinity propagation but does have a quick k-means clusterer, a downer for wanting to discover patterns without known relations. However, the relations can be estimated using sqrt(n/2) centroids. This is the number affinity propagation often chooses. With this method, it is possible to obtain decent relations in the time taken to process a simple post request.

    The Collection Process

    Data collection resulted in an automated set of interrelated scrapers,ETL processes, and triggers. Scrapers were set up for nearly every reporting site available. This meant that the American North West, Alaska, California, and British Columbia were the only sites available for collection. The Colorado Avalanche Information Center and Utah’s avalanche center were the best in terms of data with Utah providing a wide range of attributes. This data was fed to weather collecting scrapers and finally to an ETL process. I wrapped the entire process in a Spring program and configuration file.

    The Final Product and v2.0

    My final product is a site that delivers reports on incidents, weather, and other key factors as well as the opportunity to cluster what little complete data there is in your region. A heat map and google map show the incident locations. I will hopefully include relevant date filters and eventually graphs and predictions as the data grows stronger and more numerous. Weather is currently averaged from two weeks before an avalanche event. However, this could grow to accommodate historical patterns. Weather data may be the only solidly available data at the present time and so will likely happen sooner than predictive features.

    A Plea for Better Data From Avalanche Centers and Why No Predictions are Available

    In the end, I was appauled by the lack of data. People die because they know nothing of the conditions generating avalanches. I myself have felt the force of a concussion wave rippling my clothes from half a mile away. This must end. Selling data should not take precedence over safety. My site suffers at the moment from poor reporting, a lack of publicly available clean data, and the result of simple mis-reportings not caught in ETL. My data set actually shrank in cleaning from 1000 to 300 avalanches across the entire NorthWest.

    Still, weather data was incredibly public. The National Resource Conservation Service, which sells data, is a powerful tool when placed in combination with the National Atmospheric and Oceanic Society and Air Force weather station data sets.

    Overall, I can only provide for public clustering because of this poor data. Over time, this may change as clusters become more distinct and patterns and predictions more powerful. However, I would feel personally responsible for someone’s untimely end at this point. I have tried running multiple regression on this topic before but the results were dismal. While better than 2011, data sets still need improvement.

    The Future

    I have no intention of stopping collection and will document my development work on the project here. I also plan to document any attempts to develop a device that uses the data it collects and my weather and/or other data to make predictions on the snowpack.

    Could Watson Be Used to Dynamically Create Databases

    Data collection,cleaning, and presentation are a pain, especially when dealing with a multitude of sources. When APIs aren’t available and every step is taken to keep people from getting data, it can be incredibly tedious just to get the data. Parsing in this instance, of course, can be made easier by relating terms in a dictionary and using the documents structure to your advantage. At worst it is just a few lines of regex or several expath expressions and more cleaning with Pentaho. I’ve gone a bit further by enforcing key constraints and naming conventions with the help of Java Spring.

    It seems that IBM is making this process a little less time consuming with Watson. Watson appears to have the capacity to find patterns and relations with minimal effort from a CSV or other structured file.

    This could really benefit database design by applying a computer to the finding and illumination of the patterns driving key creation and normalization. After all, I would love to be able to focus less on key creation in a maximum volume industry and more on pumping scripts into my automated controllers. The less work and more productivity pre person, the more profit.

    Mornging Joe: Can Computer Vision Technology Help De-Militarize the Police and Provide Assistance?

    There ha been an explosion of computer vision technology in the past few years or even the last decade or so considering OpenCV has been around that long. The recent events in Ferguson have created a need for keeping the police in line as well as the need to present credible evidence regarding certain situations.

    Many police departments are starting to test programs that place snake cams like those used in the military on officers. While this could be viewed as more militarization, it also can present departments with a black eye if power is abused.

    What if the lawyers, police, and ethics commissions could have a way of recognizing potentially dangerous situations before they happen? What if there was a light weight solution that allowed data programs to monitor situations in real or near real time, spot troublesome incidents, and provide alerts when situations were likely to get out of hand? What if potentially unethical situations could be flagged?

    The answer is that this is possible without too much development already.

    Statistical patterns can be used to predict behaviour long before anything happens. Microsoft and Facebook can accurately predict what you will be doing a year from now. The sad state of the current near police state is that the government has as much or more data on officers and citizens than Microsoft and Facebook.

    These patterns can be used to narrow the video from those snake cams to potentially harmful situations for real time monitoring.

    From there, a plethora of strong open source tools can be used to spot everything from weapons and the potential use of force, using the training capabilities of OpenCV and some basic kinematics (video is just a bunch of really quickly taken photos played in order), speech using Sphinx4 (a work in progress for captchas but probably not for clear speech), and even optical character recognition with pytesser. A bit of image pre-processing and OCR in Tesseract can already break nearly every captcha on the market in under one second with a single core and less than 2 gb of RAM. The same goes for using corner detection and OCR on a pdf table. Why can’t it be used in this situation?

    The result in this case should be a more ethical police force and better safety to qualm the fears of officers and civilians alike.

    Call me crazy but we can go deeper than just using snake cams on officers to police officers and provide assistance.  Quantum computing and/or better processors and graphics cards will only make this more of a reality.

    Morning Joe/Python PDF Part 3: Straight Optical Character Recognition

    *Due to time constraints, I will be publishing large articles on the weekends with a daily small article for the time being.

    Now, we start to delve into the PDF Images since the pdf text processing articles are quite popular. Not everything PDF is capable of being stripped using straight text conversion and the biggest headache is the PDF image. Luckily, our do “no evil” (heavy emphasis here) friends came up with tesseract, which, with training, is also quite good at breaking their own paid Captcha products to my great amusement and company’s profit.

    A plethora of image pre-processing libraries and a bit of post-processing are still necessary when completing this task. Images must be of high enough contrast and large enough to make sense of. Basically, the algorithm consists of pre-processing an image, saving an image, using optical character recognition, and then performing clean-up tasks.

    Saving Images Using Software and by Finding Stream Objects

    For linux users, saving images from a pdf is best done with Poplar Utils which comes with Fedora,CentOS, and Ubuntu distributions and saves images to a specified directory. The command format is pdfimages [options] [pdf file path] [image root] . Options are included for specifying a starting page [-f int], an ending page [-l int], and more. Just type pdfimages into a linux terminal to see all of the options.

    
    pdfimages -j /path/to/file.pdf /image/root/
    
    

    To see if there are images just type pdfimages -list.

    Windows users can use a similar command with the open source XPdf.

    It is also possible to use the magic numbers I wrote about in a different article to find the images while iterating across the pdf stream objects and finding the starting and ending bytes of an image before writing them to a file using the commands from open().write(). A stream object is the way Adobe embeds objects in a pdf and is represented below. The find command can be used to ensure they exist and the regular expression command re.finditer(“(?mis)(?<=stream).*?(?=endstrem)",pdf) will find all of the streams.

    
    stream
    
    ....our gibberish looking bytes....
    
    endstream
    
    

    Pre-Processing

    Python offers a variety of extremely good tools via pillow that eliminate the need for hard-coded pre-processing as can be found with my image tools for Java.

    Some of the features that pillow includes are:

    1. Blurring
    2. Contrast
    3. Edge Enhancement
    4. Smoothing

    These classes should work for most pdfs. For more, I will be posting a decluttering algorithm in a Captcha Breaking post soon.

    For resizing,OpenCV includes a module that avoids pixelation with a bit of math magic.

    
    #! /usr/bin/python
    
    import cv2
    
    im=cv2.imread("impath")
    
    im=cv2.resize(im,(im.shape[0]*2,im.shape[1]*2))
    
    

    OCR with Tesseract

    With a subprocess call or the use of pytesser (which includes faster support for Tesseract by implementing a subprocess call and ensuring reliability), it is possible to OCR the document.

    
    #! /usr/bin/python
    
    from PIL import Image
    
    import pytesser
    
    im=Image.open("fpath")
    
    string=pytesser.image_to_string(im)
    
    

    If the string comes out as complete garbage, try using the pre-processing modules to fix it or look at my image tools for ways to write custom algorithms.

    Post-Processing

    Unfortunately, Tesseract is not a commercial grade product for performing PDF OCR. It will require post processing. Fortunately, Python provides a terrific set of modules and capabilities for dealing with data quickly and effectively.

    The regular expression module re, list comprehension, and substrings are useful in this case.

    An example of post-processing would be (in continuance of the previous section):

    
    import re
    
    lines=string.split("\n")
    
    lines=[x for x in lines if "bad stuff" not in x]
    
    results=[]
    
    for line in lines:
    
    if re.search("pattern ",line):
    
    results.append(re.sub("bad pattern","replacement",line))
    
    

    Conclusion

    It is definitely possible to obtain text using tesseract from a PDF. Post-processing is a requirement though. For documents that are proving difficult, commercial software is the best solution with Simple OCR and Abby Fine Reader offering quality solutions. Abby offers the best quality in my opinion but comes at a high price with a quote for an API for just reading PDF documents coming in at $5000. I have managed to use the Tesseract approach successfully at work but the process is time consuming and not guaranteed.