Python Unicode to UTF-8 Replacement Dictionary

I recently found an increasing need to replace Unicode characters with their English equivalents. This is in response to use of the ISO 8895-1 character set in html. Below is my dictionary for doing so and a code snippet for using it.

{"\x2013":"-","\x2014":"--","\x2018":"'","\x2019":"'","\x201A":",","\x201D":"~","\x2022":"*","\x2026":"...","\x2030":"%","\x2032":"'","\x2033":"`","\x2039":"","\x203E":"--","\x2044":"/","\x20AC":" euro ","\x2111":"i","\x2118":"P","\x2122":" TM ","\x2135":" alef ","\x2190":"","\x2193":" down-arrow ","\x2194":"","\x21B5":" crarr ","\x21D0":"","\x21D4":"","\x2200":"ALL","\x2202":" part ","\x2203":"EVERY","\x2205":"empty-set","\x2207":"nabla","\x2208":"isin","\x2209":"notin","\x2217":"*","\x221A":"sqrt","\x2329":"","\x25CA":" loz ","\x2660":"spades","\x2663":"clubs","\x2665":"hearts","\x2666":"diamonds","\x200C":" zwnj ","\x200D":" zwj ","\x200E":" lrm ","\x200F":" rlm ","\x27":"'","\xc2|\xA0|\x2002|\x2003|\x2009":" ","\x3E":">","\x3C":"> ","\xBC":"1/4","\xBD":"1/2","\xBE":"1/4","\xBE":"3/4","\xBF":" iquest "}

Multiple entities and a larger dictionary are provided below after an update to this function.

The code:

    def def encodeHTML(self,html,foreignKeys=None,replaceNonPrintable=False,multiEntities={"\xe2\x81\x91":"**","\xe2\x81\x95":"*","\xe2\x81\x97":'""',"\xe2\x81\xa0|\xe2\x80\x8b|\xe2\x80\x8c|\xe2\x80\x8d|\xe2\x80\x8e|‏\xe2\x80\x8f":"","\xe2\x80\x86|\xe2\x80\x87":"   ","\xe2\x80\x84|\xe2\x80\x85|\xe2\x80\x88":"  ","\xe2\x80\x8a|\xe2\x80\x89|\xe2\x80\x80|\xe2\x80\x81|\xe2\x80\x82|\xe2\x80\x82|\xe2\x80\x83":" ","\xe2\x80\x93|\xe2\x80\x92|\xe2\x80\x91|\xe2\x80\x90":"-","\xe2\x80\x96":"||","\xe2\x80\x95|\xe2\x80\x94":"--","\xe2\x81\x87":"??","\xe2\x81\x88":"?!","\xe2\x81\x89":"!?","\xe2\x81\x9d|\xe2\x81\x9e":":","\xe2\x81\x92":"-","\xe2\x81\x8b":" PILCROW ","\xe2\x80\xbc":"!!","\xe2\x80\xba":">","\xe2\x80\xb9":"<","\xe2\x80\xb8":"^","\xe2\x80\xb1":"%000","\xe2\x80\xb0":"%0","\xe2\x80\xa4|\xe2\x80\xa7":".","\xe2\x80\xa5":"..","\u2013":"-","\u2014":"--","\u2018":"'","\u2019":"'","\u201A":",","\u201D":"~","\u2022|\xe2\x80\xa3|\xe2\x80\xa2":"*"},replaceEntities={"\u2026":"...","\u2030":"%","\u2032":"'","\u2033":"`","\u2039":"","\u203E":"--","\u2044":"/","\u20AC":" euro ","\u2111":"i","\u2118":"P","\u2122":" TM ","\u2135":" alef ","\u2190":"","\u2193":" down-arrow ","\u2194":"","\u21B5":" crarr ","\u21D0":"","\u21D4":"","\u2200":"ALL","\u2202":" part ","\u2203":"EVERY","\u2205":"empty-set","\u2207":"nabla","\u2208":"isin","\u2209":"notin","\u2217":"*","\u221A":"sqrt","\u2329":"","\u25CA":" loz ","\u2660":"spades","\u2663":"clubs","\u2665":"hearts","\u2666":"diamonds","\u200C":" zwnj ","\u200D":" zwj ","\u200E":" lrm ","\u200F":" rlm ","\u27":"'","\xc2|\xA0|\u2002|\u2003|\u2009":" ","\x3E":">","\x3C":"> ","\uBC":"1/4","\xBD":"1/2","\xBE":"1/4","\xBE":"3/4","\xBF":" iquest "}):
        '''
        Encode HTML. Unfortunately
        
        *Required Parameters:
        
        :param html: html to run replacements on
        
        *Optional Parameters*
        :param multiEntities: entities represented my multiple unicode hex numbers 
        :param replaceNonPrintable: replace non printable characters after all other sets and encodings complete
        :param foreignKeys: dictionary of mapping to keys not in replaceEntities (for which are not in the default dict such as foreign letters e.g. {"\xF8":"[oslash]"} )
        :param replaceEntities: a list of entities to replace such as ,copyright symboles, micro; etc. that may be in the ISO or other format converted to unicode Hex formats (see non Latin characters at https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references)
         
        '''
        if multiEntities is not None:
            for k in multiEntities.keys():
                html=re.sub(k,multiEntities[k],html)
                
        if replaceEntities is not None:
            for k in replaceEntities.keys():
                html=re.sub(k,replaceEntities[k],html)
        
        html=HTMLParser.HTMLParser().unescape(Soup(Soup(html).encode()).prettify())
        
        
        if replaceNonPrintable is True:
            import string
            html=filter(lambda x: x in string.printable,html)    
        
        if foreignKeys is not None:
            for k in foreignKeys.keys():
                html=re.sub(k,foreignKeys[k],html)
        
        return html

Python PDF 3: Writing With HTML and XML

Alas, I have discovered the potent mixture of Jinja, weasyprint and Pandas. Mixing these tools with matplotlib and Python image modules yields a way to write PDF documents with relative ease and with the styling help of HTML. It would also be able to use a tool like xmltopdf for generating pdf files from XML. Previous Posts dealt with this using a more complicated tool, PyPDF2.

A Basic HTML Template

In this tutorial, I am using jinja to create tables. My tables will not have much in the way of styling but it is also possible to add styles with jinja or by using a tool such as Django-Tables2. Both tools are incredibly similar to the Django platform.

A template is needed in order to generate HTML pages for conversion to pdf format. Jinja follows a basic format with double curly braces used to mark where items are entered encapsulating the title of the property.


<!DOCTYPE html>
<html>
<head lang="en">
<meta charset="UTF-8">
<title>{{ title }}</title>
</head>
<body>
<div>
<h1>Weekly Summary Report</h1>
{{ summary_pivot_table }}
</div>

<div>
<h1>Frequency Report</h1>
{{ frequency_table }}
</div>

<div>
<h1>Weekly Source Reports</h1>
{{ source_pivot_table }}
</div>
</body>
</html>

In this case, there is a title and three reports. It would be easy to add CSS tags and generate different styles using the division tags. These will be converted by weasyprint later.

Writing to the Template

Writing to a template with Jinja requires using the dictionary data structure.

adminVars={"title":"Weekly Statistics","frequency_table":freqFrame.to_html(),"summary_pivot_table":sframe.to_html(),"source_pivot_table":tframe.to_html()}

Generating Data

Generating data is simple with Pandas. This is especially true with databases. One only needs to connect to a database using a SQLAlchemy engine and perform any necessary query. It is also possible to concatenate as many queries as necessary to generate a table.

import sqlalchemy
import pandas

#create alchemy engine      

dsn='postgresql+psycopg2://'+cfp.getvar("db","user","string")+":"+cfp.getVar("db", "passw","string")+"@"+cfp.getVar("db","host","string")+":"+cfp.getVar
("db","port","string")+"/"+cfp.getVar("db","dbname","string")

engine=sqlalchemy.create_engine(dsn)
        
#get totals table
query=cfp.getVar("sql","weekly_totals_complete","string")
tframe=pandas.read_sql_query(query,engine)

Concatenation is not difficult either using the concat function.

pandas.concat([pandas.read_sql_query(query,engine) for query in tables])

New columns will be generated with NaN values.

Performing Basic Operations on Dataframes

Performing operations on dataframes is easy with numpy or scipy.

import numpy

#operate on tframe from above
tframe.apply(numpy.average,axis=0)

Dataframes themselves have operations that can be formed on them and use numpy.

#tframe from above
tframe.mean()

A list of operations is provided in the Pandas documentation.

More complicated operations may require unpacking the values or using generator functions

Using Weazy Print

Once the resources and template are prepared, simply call on weazy print to convert the html resulting from the template to a PDF.

An extra import is needed to fetch resources such as images from links embedded within the url.

Otherwise, generate a pandas data frame, conver the frame to html and place as the value attached to the appropriate template key in your dictionary and then convert. The example code uses SQLAlchemy to fetch resources from a PostgreSQL database.

from crawleraids.ConfigVars import Config
from jinja2 import Environment,FileSystemLoader
import pandas
from weasyprint import HTML,default_url_fetcher
import sqlalchemy

def fetchURL(url):
   '''
   Provide a resource obtainer for getting urls to weazy print
   '''
   return weasyprint.default_url_fetcher(url)


def generatePDF(fpath):
       '''
       Generate the pdf.
       '''
       #create alchemy engine
       cfp=Config(fpath)
       
dsn='postgresql+psycopg2://'+cfp.getvar("db","user","string")+":"+cfp.getVar("db", "passw","string")+"@"+cfp.getVar("db","host","string")+":"+cfp.getVar("db","port","string")+"/"+cfp.getVar("db","dbname","string")
        engine=sqlalchemy.create_engine(dsn)
        
        #get totals table
        query=cfp.getVar("sql","weekly_totals_complete","string")
        tframe=pandas.read_sql_query(query,engine)
        
        #get summary stats table
        query=cfp.getVar("sql","weekly_summary_complete","string")
        sframe=pandas.read_sql_query(query,engine)
        
        #get frequencies
        query=cfp.getVar("sql","weekly_frequency","string")
        freqFrame=pandas.read_sql_query(query,engine)
         
        #get the resource loader 
        env=Environment(loader=FileSystemLoader('.'))
        template=env.get_template("aboveTemplate.html")
         
       #fill the template
       vars={"title":"Weekly Statistics","frequency_table":freqFrame.to_html(),"summary_pivot_table":sframe.to_html(),"source_pivot_table":tframe.to_html()}
       html=template.render(vars)
       
       #use weazy print to convert to pdf
       HTML(string=html).write_pdf(target="/reports/report.pdf",stylesheets=cfp.getVar("style","aboveTemplate","string"))

All of the power of Pandas is now at the disposal of the programmer along with anything that can be embedded in a url.

Generating and Saving Graphs with Pandas and Matplotlib or PyPlot

It is possible to embed graphs into a pdf by saving them as images.

Obviously, the folks behind pdf allow most things made of bytes to be placed in objects in a PDF (a pdf is a series of pdf objects much like xml with byte strings in base 64 as the text). See my magic numbers post and try to parse or write your own image to a pdf if you really want to dive into the subject.

Generating graphs is simple Pandas. Just make sure to match the template graph with the image url.

import matplotlib.pyplot as plt

data=[[1,3,5,2],[1,3,4]] #perform operations on the data to transform the graph. Each array is a new plot line.
df = DataFrame(data,columns=[['PlotA','PlotB']))
fig=df.plot()
fig=fig.get_figure()
fig.savefig('graph.png') #also can save as own pdf to be merged as described in an earlier post

It is possible to do this directly with pyplot as well.

Using Flask to Create a PDF Web Server

It appears from comments and questions that pfd servers are often a request. The clunkiness of Spring can now be replaced easily with the combination of the mentioned tools and the Flask web framework. These tools allow for the quick and easy creation of a pdf web server. However, asyncore with socket, Spring with Java based tools, or other tools will need to be run if the plan is to use something akin to the proxy pattern, a sad state of affairs.

To create the server, simply create a method with an annotation specifying the path, much as would happen in spring.

from flask import Flask
try:
   from cStringIO import StringIO
except:
   import StringIO

from flask import send_file
app = Flask(__name__)

def otherFunc():
    pass

@app.route("/")
def generatePDF():
    #code to generate PDF........
    StringIO(pdf)
    return send_file(pdf, attachment_filename='file.pdf')

if __name__ == "__main__":
    app.run()

Weasyprint also includes a way to incorporate pre-generated pdfs from within the same application.

Shooting the Gap Statistic with Map Reduce

Finding the right K in Kmeans, n in LSA, n in LSI, or whichever grouping algorithm you use is fuzzy, especially with text. However, Tibshirani,Walther and Hastie developed a superb algorithm to deal with this, much better than finding an inflection point in your data as it handles abnormalities and ensures a proper distribution.

An easier understanding of the algorithm can be found through a blog post from the Data Science Lab.

This algorithm is intensive but,luckily, map-reduce exists. Splitting the procedure among a large number of machines is most helpful. Python and Scala offer map reduce and concurrent frameworks that can handle fairly significant amounts of data either directly or via a framework such as Celery.

Scala’s callback mechanism is a personal favorite of mine although the tools for experimentation are not as readily available as they are with Python. Tools such as gensim and sci-kit learn are readily available and easy to test in Python.

Using map reduce for this algorithm is almost a necessity for anything more than a small data set, especially for text data. However, unlike the Pham et. al algorithm, Tibshirani does not rely on the previous output at all. In this way it can be spread over a large number of machines to obtain a comparable result (use Pham for smaller datasets such as less than 30,000 text documents that fit on a single machine).

The following is an example of the fastest way I ran this algorithm using Python. Whereas a single threaded algorithm was running 20 tests in 10 minutes, the following test were running 10.4 tests per minute on a 3.5 ghz core i-7 with 15 gb of RAM.

The proper tradeoffs must be made between speed and memory.

Firstly, Python offers a multiprocessing version of map. The map function is applied using a multiprocessing pool which is easy to establish.


from multiprocessing import Pool
import psutil

pool=Pool(psutil.cpu_count(logical=True)

You will need to fiddle with an equation that works with Pool’s processors function. It may not be the best for your data set.

This pool can be used (please read the Data Science Lab article first) to find Wk, run multiple Kmeans and split up among other processes.

Since I am text mining, my Wk and distance functions now become the following. I will not be releasing my text based KMeans here, sorry but note that it involves using a special function for finding clusters based on the dissimilarity rating.

  
def Wk(X):
    if X.shape[0]>1:
        try:
            mul=float(1/float(2*X.shape[0]))*float(2*X.shape[0])
            mvect=scipy.mean(X,axis=0)[0].todense()
            mp=map(lambda x: float(mul*angulareDistance(x[0].todense(),mvect)),X)
            res=sum(mp)
        except:
            res=0
    elif X.shape[0] is 1:
        res=1
    else:
        res=0
    return res
def calculateQueuedCosines(x):
    d=[]
    for j in xrange(len(x[1])):
        d.append(cosine(x[0].todense(),x[1][j]))
    return d.index(max(d))

def performWK(X):
    '''
    This function performs WK for a map and requires an input matrix,predictions from KMeans, and a k value.
    0=>ref
    1=>preds
    2=>K
    '''
    preds=X[1]
    results=[]
    for currk in range(X[2]):
        set=None
        for i in range(X[0].shape[0]):
            if preds[i] == currk:
                if set is None:     
                    set=scipy.sparse.csr_matrix(X[0][i])
                else:
                    set=scipy.sparse.vstack((set,X[0][i]))
        res=0
        try:
            if set is not None:
                if set.shape[0]>1:
                    try:
                        mvect=set.sum(axis=0)/set.shape[0]
                        mp=map(lambda x: float(angulareDistance(x[0].todense(),mvect)),set)
                        res=sum(mp)
                    except:
                        res=0
                elif set.shape[0] is 1:
                    res=1
        except Exception,e:
            print str(e)
        results.append(res)
        gc.collect()
        del gc.garbage[:]
    return results

def performTest(X):
    '''
    For performing tests in separate processes.
    
    Requires a tuple with
    0=>ks
    1=>mtx
    '''
    kmn=KMeans()
    kmn.n_clusters=X[0]
    kmn.fit(X[1])
    preds=kmn.predict(X[1])
    return preds
	
def mapWKResults(X):
    return numpy.log(sum(numpy.nan_to_num(X)))

I then use these functions to split up intense processes among multiple cores. Keep in mind that this is looking to work with text data. The distance formula becomes the angular distance formual with cosines (1-2*acos(cos(veca,vecb))/pi).

  
def discoverCategories(self,file,refs=20,maxTime=600,ks=range(1,500),mx=20):
        '''
        A 1 to n estimator of the best categories using cosines to be used as  a backup.
        This is a sci-kit learn /python/scipy learning experience as well.
        
        Reference: https://datasciencelab.wordpress.com/2013/12/27/finding-the-k-in-k-means-clustering/
        '''
        pool=Pool(psutil.cpu_count(logical=False))
        vectorizer=CountVectorizer(analyzer='word', binary=False, decode_error='strict',dtype=numpy.int64, encoding='utf-8', input='content',lowercase=True, max_df=1.0, max_features=None, min_df=1,ngram_range=(1, 1), preprocessor=None, stop_words=None,strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',tokenizer=None, vocabulary=None)
        logging.info("Starting KMN "+datetime.datetime.fromtimestamp(time.time()).strftime("%m/%d/%Y"))
        kmn=KMeans()
        lines=[]
        with open(file,'rb') as fp:
            lines=fp.read().split("\n")
            
        if len(lines)>0:
            mtx=vectorizer.fit_transform(lines)
            tfidf=TfidfTransformer(norm='l2',smooth_idf=True,sublinear_tf=False,use_idf=True)
            mtx=tfidf.fit_transform(mtx)
            del tfidf
            del vectorizer
            lines=[]
            del lines
            gc.collect()
            del gc.garbage[:]
            
            (ymin, xmin), (ymax, xmax)=self.bounding_box(mtx)
            Wks=[]
            Wkbs=[]
            sk=[]
            
            for k in ks:
                #get the overall WK
                print "Testing at Cluster "+str(k)
                tempk=[]
                kmn.n_clusters=k
                kmn.fit(mtx)
                preds=kmn.predict(mtx)
                sets=[]
                for currk in range(k):
                    set=None
                    for i in range(mtx.shape[0]):
                        if preds[i]==currk:
                            if set is None:     
                                set=scipy.sparse.csr_matrix(mtx[i])
                            else:
                                set=scipy.sparse.vstack((set,mtx[i]))
                    if set.shape[0]>0:
                        sets.append(set)
                res=pool.map(Wk,sets)
                Wks.append(numpy.log(sum(res)))

                del set
                gc.collect()
                del gc.garbage[:]
                
                BWkbs=[]
                #generate individual sets and calculate Gap(k)
                mres=[]
                for i in range(refs):
                    print "Setting Up Test "+str(i)
                    ref=scipy.sparse.rand(xmax,ymax,format='csr')             
                    mres.append(ref)
                    if len(mres) is mx or (i+1 is refs and len(mres)>0):
                        print "Performing Async Tests" #if we were to create our own distributed framework with pyHFS,asyncore, and a db
                        preds=pool.map(performTest,[[k,x] for x in mres])
                        res=pool.map(performWK,[[mres[j],preds[j],k] for j in range(len(mres))])
                        BWkbs.extend(pool.map(mapWKResults,res))
                        mres=[]
                        gc.collect()
                        del gc.garbage[:]
                    del ref
                    gc.collect()
                    del gc.garbage[:]
                s=sum(BWkbs)/refs
                Wkbs.append(s)
                sk.append(numpy.sqrt(sum((numpy.asarray(BWkbs)-s)**2)/refs))
        
        sk=numpy.asarray(sk)*numpy.sqrt(1+1/refs)
        return(ks,Wks,Wkbs,sk)

The results take a while but there is a basic rule based on non-zero values in a matrix for text data. This can be used to determine the maximum amount of clusters to be run, ensuring some buffer room.

  
  import numpy
  (mtx.shape[0]*mtx.shape[1])/numpy.count_nonzero(mtx)

The code splits up the most intense function into map and reduce tasks. Map reduce could be use in each case (please see the sum functions attached to the final result).

  
preds=pool.map(performTest,[[k,x] for x in mres])
res=pool.map(performWK,[[mres[j],preds[j],k] for j in range(len(mres))])
BWkbs.extend(pool.map(mapWKResults,res))

These are the Kmeans functions, reduction across all vectors, and a slight performance boost by performing the reduction on the result to obtain our Wk value for the overall equation.

Ideally, this process can be used to find the best number of categories or groups in a variety of algorithms such as fuzzy clustering with membership matrices or LSA.

Secure Your Data Series: Why a Captcha Alone Fails

Since captchas are meant to be unreadable by a computer, they are a great tool for better learning the task of OCR. As even Google now admits, Captcha’s are breakable. This is more concerning from a security standpoint, revealing that even an open source OCR like Tesseract can defeat this system. A little computer vision and some basic pre-processing in python will break most Captchas. That is why I champion the use of a mapping and analysis of click stream data with Latent Dirichlet Allocation to classify human from non-human or hacker from non-hacker (stay tuned its coming). Adding the LDA approach to a captcha system with a higher probability of failure for automated processes, guessing here, and use of click stream data to form vectors (literal mathematical vectors) and security becomes a lot better.

Let’s do some Captcha breaking but beware this is purely educational and not for breaking the law! Many Captchas have sound options to comply with handicap laws of which simpler puzzles can be broken with sound recognition such as Sphinx4. However, the dilution of the sound in modern Captchas can make OCR useful for aiding the disabled. Basically, there are uses of this code that are likely to remain legal as companies look to narrow the definition of authorized access.

Image Preprocessing

Captcha images contain all sorts of clutter and manipulations with the goal of eliminating readability. This makes pre-processing critical to the goal at hand. Speed is the crucial consideration in this task so any library or custom code needs to be extremely efficient.

Two modules exist in Python that help with preprocessing. They are OpenCV (cv2) and pillow (PIL). My image toolset can also be used for these tasks.

OpenCV is a powerful open source library with the aim of making a lot of calculus and differential equation based code for computer vision incredibly easy to deploy. It runs extremely quickly. The modules are largely written in C and there is also a C++ API. OpenCV is great if you want to write custom code too as the tutorials also dive deeply into the mathematics behind each program. For this case, classes from cv2 including resize (which goes further than basic expansion),Canny edge detection, and blurring are highly effective. After writing the classes in Java and even using a graphics card library in python to do the same tasks, it appears that OpenCV matches or is only slightly slower than the custom code. The images are small though.

Other modules are incredibly good at performing contouring, corner detection, and key point finding. If you have a computer vision or artificial intelligence task, Open CV is the go-to API.

import cv2

For basic pre-processing, pillow is also an incredibly fast library. Again, compared to my own Java code, the modules work at about the same speed. The idea behind them is the use of kernels, small matrices filled with weights that can be used to distribute color in an image via a window.

from PIL import ImageEnhance
from PIL import ImageFilter
from PIL import Image

All of the necessary pre-processing, whether custom or module based can be completed in less than one second, producing the result shown below. However, it is necessary to fiddle with the images until they look as close to they way a normal input would.

Overall, the total time taken to break a captcha ranges from roughly one second or less to four seconds on a dual core machine with 4gv of RAM. Completing tasks with custom code may improve speed when using faster matrix libraries but numpy is fairly efficient in today’s world.

One extremely useful trick is to resize the image and improve contrast.

    img=Image.open('captcha.jpg').convert('L')
    img=np.array(img,np.uint8).copy()
    img=cv2.resize(img,(img.shape[1]*2,img.shape[0]*2)) 
    img=Image.fromarray(img)

    enhancer=ImageEnhance.Contrast(img)
    enhancer.enhance(2)

If using numpy, there is an extremely useful way to apply a function to all pixels of an image as well using some Python magic.

   numpyimage[numpyimage<128]=0

Decluttering with Statistics

Certain transforms and other techniques may leave unwanted clutter in the image. Ridding small or abnormally sized objects from an image is performable with basic statistics. Remember that 1.5 standard deviations [sum(x-mean)^2/n] is a normal outlier and 3 standard deviations is an extreme outlier. This can be used to eliminate elements that are longer than others. The example below follows an object and eliminates it based on width and has proven successful. If vertical objects are present, surface area coverage may be a better consideration. These work better than contouring here because the images are not always connected properly. They need to be readable by a human and not a computer.

def declutter(self,inarr):
        """Declutter an Image"""
        arr=inarr
        height=len(arr)
        width=len(arr[0])
        
        total=0
        ws=0
        j=0
        avg=0
        account=False 
        wsarr=[]
        
        #get the avg, total
        for i in range(height):
            for c in arr[i]:
                if c < 128 and account is True:
                    account=False
                    total+=1
                    avg+=(j-ws)
                    wsarr.append((j-ws))
        avg/=total
        sd=0
        
        #calculate sd
        for n in wsarr:
            sd+=((n-avg)*(n-avg))
        
        sd=math.sqrt((sd/((total-1))))
        eo=1.5*sd
        o=sd
        
        ws=0
        #perform declutter
        for i in range(height):
            for c in arr[i]:
                if c128 and account is True:
                    account=False
                    
                    if (j-ws) > (avg+o) or (j-ws) <(avg-o):
                        for j in range(j-ws):
                            arr[(ws+j)]=255
                        total+=1
        print str(total)+" objects removed"
        
        return (arr,total)       

 

Rotating with the Bounding Box and an Ode to OpenCV

In order to complete our work, it is a good idea to know how to find the minimum bounding box and rotate the image. This is made difficult in many instances by the fact that the letters in a Captcha are not always continuous black lines. Thankfully, OpenCV contains a multitude of feature detectors that can help outline or find key features for the different objects.

My initial attempts at this involved a gaggle of different methods. After attempts at using a threshold with Euclidean distances to produce a set further reduced by basic statistics revolving around centroids, and a few other methods, I arrived at contours. The Euclidean distance method would likely work on regular text lines but, here, characters are intentionally smashed together or unwanted lines mixed in. I kept getting double letters with different Captchas.  The result of using these methods is the feeling of frustration.

In contrast to the feeling of being lost, OpenCV’s contour class can find angles, bounding boxes that are rotated, and many other useful features. Contours rock.

A contour basically uses defined edges. Algorithms include using Latent Dirichlet Allocation to group objects found from edge detection and marching squares with defined edges following certain angles and patterns, much like a Haar cascade in a way. Obviously any AI implementation is much stronger.

   def getContours(self,img):
        """
        Get Contours from CV2
        """
        hsv_img=cv2.cvtColor(img,cv2.COLOR_BAYER_BG2BGR)
        hsv_img=cv2.medianBlur(hsv_img,5)
        sv_img=cv2.cvtColor(img,cv2.COLOR_BAYER_BG2BGR)
        hsv_img=cv2.medianBlur(hsv_img,5)
        kernel=np.ones((5,5),'uint8')
        hsv_img=cv2.dilate(cv2.erode(hsv_img,kernel,iterations=3),kernel)
        
        COLOR_MIN=np.array([255,255,255],np.uint8)
        COLOR_MAX=np.array([255,255,255],np.uint8)
        
        frame_threshed=cv2.inRange(hsv_img,COLOR_MAX,COLOR_MIN)
        #img=cv2.imread("/home/aevans/Documents/testimages/ElPasoAOC/elpasoAOC_1424991951.2_14.jpg",0)
        #ret,thresh=cv2.threshold(img,127,255,0)
        ret,thresh=cv2.threshold(frame_threshed,127,255,0)
        return cv2.findContours(thresh,cv2.RETR_TREE,cv2.CHAIN_APPROX_SIMPLE)    

With the contours found, it is a good idea to discover any patterns in spacing and use those to combine the boxes with an expanded intersection condition and a while loop.

Now, we can rotate our individual letters. The equation here is simple and the code can be written easily. The following kernel is useful for rotation about the center of an image.

The code for the equation in scipy is as follows. It is possible to use the above matrix on each pixel to discover the new location as well. I wrote the kernel into the sample.

   def rotate(self,img,cnt):
        '''
        Utilizes contouring via cv2 to recognize letters and rotate them.
        Takes in a numpy array to work with and returns a set of numpy arrays.
        This will rotate on a provided contour.
        
        *Required Parameters*
        
        :param img: image as a numpy array
        :param cnt: contour to rotate on
        '''
        im=img
        try:
            #get the basic points
            box=np.int0(cv2.boxPoints(cv2.minAreaRect(cnt)))
            x0,y0=box[0]
            x1,y1=box[1]
                    
            
            #get the rotational angle
            degree=-1*(180*math.atan((y1-y0)/(x1-x0)))/math.pi
            
            print "Rotational Degrees: "+str(degree)
        except:
            return im
        
        #rotate with interpolation in scipy
        return ndimage.rotate(im,degree,mode='nearest',cval=100)

Another, more powerful rotation uses the same ndimage (whose cropping algorithm likely interpolates much better than simply multiplying [(x,y),(x1,y1)] by [(x),(y)] and crops. This uses a best fit line with slope found using one of a variety of least squares calculations. The cv2 function used returns vectors collinear to x and y with length one in the other place.

   def rotateFromPixels(self,image,cnt,tolerance=(10*math.pi)/180):
        """
        Takes in contour information and rotates based on
        the line of best fit through contour points. I discovered
        that this could be done after getting the original box though
        the box may reduce the effect of large skews.
        
        This program also considers the minimum area rectangle in 
        case a letter is actually skewed only stretched (skew not detectable from box). I'm guessing
        that these will not always be the same since a statistical best fit
        is not always the same as finding the absolute max and min points.
        
        *Required Parameters*
        :param image: the image to rotate
        :param cnt: the contour information
        
        *Optional Parameters*
        :param tolerance: the allowable tolerance (deviation from 0 degrees in radians)
        """
        d90=(90*math.pi)/180
        box=np.int0(cv2.boxPoints(cv2.minAreaRect(cnt)))
        x0,y0=box[0]
        x1,y1=box[1]
        print str(math.atan(y1-y0/x1-x0))
        boxpheta=math.atan(y1-y0/x1-x0)
        image2=image
        print "BoxPheta: "+str(boxpheta)
        
        if abs(boxpheta-d90) > tolerance:
                
            [vx,vy,x,y]=cv2.fitLine(cnt,cv2.DIST_L2,0,0.01,0.01)
            
            slope=-1*(vx/vy)
            
            #find the perpendicular slope to the given slope
            
            i=1
            
            if vx[0] is 0 and vy[0] is 1:
                return image2
            else:
                print "Slope Points: "+str(vx[0])+","+str(vy[0])
                print "Slope: "+str(slope)
                
                Image.fromarray(image2).show()
            
                pheta=math.atan(slope)
                
                print "Pheta: "+str(pheta)
                print "Pheta (degrees)"+str((pheta*180)/math.pi)
                i+=1
                time.sleep(random.randint(2,4))
                print "\n\n\n\n\n"
            if vx[0] >0:
                image2=ndimage.rotate(image2,(pheta*180)/math.pi,mode='nearest',cval=100)
                Image.fromarray(image2).show()
        return image2 
    

If rotation is necessary, it may be possible to stick letters in a set and then attempt to read them individually rather than stitch an image together. Since a white background is present, it is also possible to rotate the letters and stick them back into the image. Please read the alignment section with code regarding alignment to see how this is done.

Alignment

Aligning individual images is normally necessary, especially for systems considering shapes that use LDA or a similar AI task to learn different groupings. If a u is too far above a proceeding letter like an S, the response from the system may be ” instead of u.

Using cv2, it is not difficult to align letters that are nearly good enough to run through OCR. Open CV includes powerful contouring tools based on machine learning techniques. Contouring allows individual letters to be discerned, allowing for even more techniques to be applied such as a rotational matrix to the image matrix as described in rotation.

The proceeding code exemplifies the process of alignment. It does not, however, consider whether a letter has a stem.

   def alignLetters(self,img,maxBoxArea=None,minBoxArea=None,printBoxArea=False,imShow=False):
        """
        Align Letters in a Pre-Processed Image. Options are provided
        to limit the size of accepted bounding boxes, helping eliminate 
        non-contours and the usual box covering the image as a whole.
        
        Returns a PIL Image.
        
        *Required Parameters*
        
        :param img: numpy array image to use
        
        
        *Optional Parameters*
        
        :param maxBoxArea: maximum bounding box area to accept (None is a non check)
        :param minBoxArea: minimum bounding box area to accept (None is a non check)
        :param printBoxArea: boolean to state whether to print box areas (list trimmed if maxBoxArea or minBoxArea set)
        :param imShow: show the individual contoured images and final image
        """
        #setup image
        width=img.shape[1]
        height=img.shape[0]
        
        image,contours,hierarchy=self.getContours(img)
        
        idicts={}
        istarts=[]
        
        maxheight=0
        maxwidth=0
        numboxes=0
        
        #get the contours and bounding boxes and filter out bad bounding boxes
        for cnt in contours:
    
            x,y,w,h=cv2.boundingRect(cnt)
            
            #obtain only bounding boxes meeting a certain criteria
            if (maxBoxArea is None or w*hminBoxArea):
                if printBoxArea is True: 
                    print str(w*h)
                
                numboxes+=1
                
                if h>maxheight:
                    maxheight=h
                
                if w>maxwidth:
                    maxwidth=w
                
                istarts.append(x)
                
                image2=img[y:y+h,x:x+w]
                idicts[x]=image2
        
        istarts=list(sorted(istarts))
        endImg=Image.new('L',(width,height),'white')
        
        #write each box to a new image aligned at the bottom
        minHeight=sys.maxint
        maxWidth=0
        for x in istarts:
            imgx=idicts[x].shape[1]
            imgy=idicts[x].shape[0]
            img=Image.fromarray(idicts[x])
            
            if imShow is True:
                img.show()
                
            if x+imgx>maxWidth:
                maxWidth=x+imgx
                
     
            for i in range(x,x+imgx):
                height=maxheight-1
                iheight=imgy-1
                miny=height-imgy
                
                if minyminy:
                    endImg.putpixel((i,height), img.getpixel((i-x,iheight)))
                    height-=1
                    iheight-=1
        
        #trim
        endImg=endImg.crop((0,minHeight, maxWidth,maxheight))
        
        if imShow is True:
            endImg.show()
        
        return endImg

Some Final Processing

Your image should either be split up and stored in a set of characters or look fairly discernable by this point. If it is not, take the time to do some final pre-processing. Using pillow (PIL) to expand edges and eliminate disconnects is one important task for final processing. Try to make the image look like newsprint.

Tesseract: The Easy Part

Now, the image is ready for Tesseract. The command line code can be run via the subprocess library from a temporary file or a pipe can be established directly to Tesseract. The OCR splits the letters into key features, clusters on them, and then either compares the offsets of characters it finds to a letter set or runs a co-variance comparison. I personally will test the co-variance approach in a later article and am building a massive training set.

If you have a set of letters, be sure to use the -psm option and set it to 10. This tells Tesseract to perform single character analysis.

Running Tesseract is simple with pytesser.

  import pytesser
  from PIL import Image
  response=pytesser.image_to_string(Image.open('image.jpg').convert('L'),cleanup=True)

Please. Do not Spam. I am not responsible for the misuse of this Code. This is purely educational!!!
P.S.

For shits and giggles, here is an LDA algorithm that can be used with a covariance matrix and your own really,really big set of letters from sci-kit learn. Python really is excellent for AI, much easier to use than Java if you want infinite matrix size thanks to gensim,numpy, and sci-kit learn. Dare I say numpy is even faster at matrix solving than almost all Java packages when not using a tool like Mahout due to the use of BLAS and LaPack when possible. That is in part why I used Tesseract. The other is that goolge runs or ran a Captcha service while also supplying this really amazing OCR product.

Morning Joe: Categorizing Text, Which Algorithms are Best in What Situations

There is a ton of buzz around data mining and, as always, many new names being injected into the same topics showing a lack of study. While the buzz exists, knowing when to deploy an algorithm can be tricky. Based on a deep dive into the subject recently and a dire need to program these algorithms in Java, I present a brief overview of tools with benchmarks and examples to hopefully follow later. Basic concepts are presented first followed by some algorithms that I have not really honed yet (benchmarks are not feasible for now but will be incredibly soon).

To read this thoroughly, I highly recommend following the links. This is a starting point and is also a summary of what I have found to now. Basically, the goal is to help avoid the hours it takes to find information on each and do some stumbling by reading this document.

A word of caution, effective matrix based solutions use a large amount of data. Other fuzzy algorithms exist for discovering relate-ability between small sets of items. For strings, there is distance matching such as Jaro-Winkler or Levenshtein with rule based comparisons and lookup tables to minimize error (say between Adama and Obama). Statistics can enhance this process if there is a need to take the best rating. Train a model for discovering whether the hypothesis that two entities distances makes them the same as opposed to the null hypothesis that it does not after filtering out some common issues.

The Matrix

Linear algebra is a critical foundation of text mining. Different elements are thought of as equations When we have different documents or images, each document or image is often considered to form an equation. This equation can then be presented in a matrix, a simple and easy way to get rid of thinking in really complex terms.

You may have heard of the vaulted differential equation. If not, some reading is in order from a site I used in college when there was not time for the book. The basis of a large portion of differential equations can be written in a matrix. This is important due to the eigen vector and eigen value. To be sure these concepts are crucial for solving matrices to find equations that explain a set of models. Drexel Universities eigenfaces tutorial provides a fairly solid understanding of the way a matrix is used in most text mining. However, similarity ratings are used to compare documents rather than a co-variance matrix for most tasks.

The end result of studying these methods is the ability to look under the hood at today’s hottest text mining technologies.

The Vector Space and TF-IDF

Understanding vectors and vector operations is another crucial step to understanding the mining process. A basic vector is a set of points representing a position in these planes . Vectors can be added, subtracted,multiplied, and, most importantly, stuffed in a matrix where their points can be used with basic linear algebra to find relationships.

Vectors have magnitude and distance. The angles and distances can be compared. Note that, while data loss may be present in finding the right magnitude and distance, the units used should be the same (it would be a terrible idea to think in terms of say millisecond-meter-documents-ice cream cones), it provides a sane basis for considering data. It is up to the miner to choose the most representative points for use in the vector.

In text mining, term frequency-inverse document frequency rating is used in many commercial algorithms including search engines. If the name is not enough, it is basically a frequency ratio based on the ratio of individual document . It works best on more than one document and an offset of 0.5 for term frequency helps offset the effect of large documents slightly by bumping up the rating. Inverse document frequency utilizes a logarithm function to ensure that the rating remains between 0 and 1.

Multiply the following equations together to find the result as described by Wikepedia:

Similarity Ratings

No matter what you do, similarity ratings are the key to making the process work. There are a several that can be used. If the data can be represented fairly well, co-variance is an option . However, text data is not that well suited to using co-variance. This is due to varying styles that represent the same object and, most importantly, issues with quantization. Natural language is naturally fuzzy. Therefore, cosines usually offers a much better solution.

This cosines equation takes the product of two vectors or the sum of two vectors and divides by the product of there normalized vectors or the sum of their normalized vectors. It follows from vector algebra. The result is an angle representing the ‘degree’ of similarity. This can be used for comparison.

Word Net, Disambiguation, and Stemming

The process of disambiguation and stemming are crucial to text mining. There are many sentence processing methods as NLTK shows. At their core is WordNet and other dictionaries. WordNet is a freely available graph of an english dictionary. Most tools work with WordNet for finding root words, disambiguation, and cleaning.

Part of Speech or POS tagging is involved in both disambiguation or stemming. Maximum entropy models are used to discover a part of speech based on common usage.

Disambiguation attempts to resolve words with multiple meanings to their most probable meaning. The worst algorithm is original Lesk but involves only the use of WordNet. Accuracy hovers around 50 percent. Simplified Lesk achieves better results. Lesk finds overlapping words and frequencies to determine the best synonym to replace an ambiguous word. The better algorithms try to use clustering bayes for word sense discovery. Cosines may be used to improve Lesk as well.

Stemming reduces words to their roots. Most WordNet tools use existing classifications with POS tagging to achieve this result.

A Note on Regression Models

Lets be certain, prediction is not well suited to categorization. Changes in word choice across a large number of documents and decisions on importance do not always mean the same thing. Therefore, regression models tend to work poorly. The data is not likely continuous as well. Think of writing like a magnetic field with eddy currents. Predicting the effect due to an encounter with these currents is really, really difficult. Basically, run into an eddy current and you are going to have a really, really bad day. That is not to say that an equation can be created that fits most of the data with respect to location of a point, basically a differential equation. It will likely not be generic and be incredibly difficult to find.

Regression works well on continuous and more natural events.

Classification Tree and a Random Forest

Another often poor performer in categorization of text data is the classification tree. They are as good as the number of rules you are willing to create. However, they may be combined with multinomial Bayes for writing that is uniform and professional (say a legal document) to achieve some success. They are particularly useful after filtering data using LSA/HDP or Multinomial Bayes with decisions that work like a bayesian model when thinking about the bigger picture.

Basically, a classification tree uses probabilities within groupings to ascertain an outcome and moving down to the appropriate left or right child node based on a yes or no response to the question do you belong?

This process works well with defined data when there is a good degree of knowledge about a subject (say gene mapping) but text mining often uses fuzzy data with multiple possible meanings and disambiguation is not entirely accurate, Lesks original algorithm only acheived 50 percent accuracy and an LSA model hovers between 80-90%. Improving the quality can be done with multiple trees or possibly by training off of an extremely large set using cosines instead of raw frequency.

There are multiple methods for tree creation, two are random forests and bagging. Bagging takes multiple trees and averages the probabilities for decisions, using this average for their respective nodes. Random forests find random subsets of features, find probabilities based on them, and select the stronger predictor for a node. The latter approach is best with a much larger set of known features. The number of forests is usually the square root of the total number of features.

Again, the features must be known and fairly homogeneous. Text data is often not.

Multinomial Bayesian Classifier

Multinomial Bayesian Classification is a method that classifies data based on the frequency of words in different categories and their probabilities of occurrence. It is fairly straightforward, find the frequencies or train a set of frequencies on a word by word or gram by gram (a gram being an n-pairing and thus n-gram of words), find probabilities by sentence, take the best one.

MNB works well when writing differs starkly, say with subject matters that differ greatly. It is good for tasks such as separating spam from policies and code in html data when large amounts of training data are present.

Clustering with LSA or HDP

Clustering works well when something is known about the data but categorization is not well done manually. Most algorithms avoid affinity propagation which usually uses the square root of total inputs as the number of clusters any way. Matrices are used heavily here as eigen values and eigen vectors derive an equation that can be used to find relate-ability between documents.

LSA uses raw frequencies or more effectively cosines in the same manner as eigen faces to compare vectors. The end result, however, is an equation representing a category. By matrix inversion and multiplication, all elements are compared. In this case each ij entry in the matrix is a cosine or frequency of a word in a document. HDP (hierarchical direchlet process) is similar but attempts to learn more about the results and improve on the process. It takes much longer than LSA and is experiemental.

If trying to discover new information about text or trying to find the best fit of a number of categories, these methods are useful.

Maximum Entropy Models

Maximum entropy models work well on heterogenous data in a manner similar to Bayes. Gensims sentence tagger classifies sentences from non-sentences in this way. The models find entropy using the maxent principle which uses frequencies and likelihood of occurrence to find outcomes. It works quite well with the correct training sets.

If conditional independence is not assumable and nothing is known about a set, this model is useful. Categories should be known beforehand.

Tools

Java

Python

Common Resources

Avalanche Data Part I: Collecting and Analyzing Public Information for Patterns

It has been a goal of mine for a while to collect and analyze publicly available avalanche data to discover patterns and raise awareness. My home state of Colorado is a disastrous combination of climate, tourists, newcomers, and testosterone junkies with a varying degree of IQ who perform little to know thought before jumping on our Continental slopes. The result can be 20 fatalities each winter season. While the availability of public information is appauling, I did manage to wrangle together a large database with hundreds of incidents, millions of weather records, and a variety of locations across many different states.

As of today, this data is available via post or by visiting my website, be kind as I will ban people bogging down my small amount of resources and may even retaliate. Use wireshark or Firebug to decipher the request format.The port will hopefully go away once I set up Apache, port forwarding is not allowed by my hosting service and I needed a bizzarre install of Tomcat that is missing the configuration file with authbind.

My goals for this project were simple, establish an automated network for data collection, ETL, and the combination of the data which is placed in a relational database. That database is then analyzed using a set of open source tools and custom code for statistical analysis from Apache Commons Math for clustering and some analysis.

Attributes I Needed and What I Found

I wished for prediction so I needed everything from crystal type to weather patterns. Avalanche, crown, base layer type, depth, path size, destructive force, terrain traps, and a variety of other factors are important. Regression testing on what I did receive showed previous storms,terrain traps, and the base layer to be the most important factors for size and destructive force.

However, this data was dirty, not cleanable with great expense, and somewhat unreliable. Only two sites reliably reported crown depth, width, and even base layer. Agencies are likely not forthcoming with this information since it relates directly to sales revenue.

Only the weather data, which can be acquired from many government sites was forthcoming.

Web Framework

I decided to use Tomcat as the web framework to deploy my WAR. This is only my second attempt at Spring. Tomcat is an incredibly fast framework as evidenced by my site. Spring is an incredibly decent framework for handling requests, requiring much less code when set up than most other frameworks. In particular, the Request handling is of significant benefit. Post requests are handled with GSon.

Below is a basic request mapping:

        @RequestMapping(value = "/", method = RequestMethod.GET)
	public String home(Locale locale, Model model) {
		//The Request Mapping
		ServletRequestAttributes requestAttributes = ((ServletRequestAttributes) RequestContextHolder.currentRequestAttributes());
		String ip=requestAttributes.getRequest().getRemoteAddr();
		
		//my ip should be all zeros for local host at the moment and so I need to change it 
		if(ip.contains("0:0:0:0:0:0:0:1")||ip.contains("127.0.0.1")){
			//ip detection
		}
		
		
	
		ClassPathXmlApplicationContext ctx=new ClassPathXmlApplicationContext("avy.xml");
		
		GeoDBPath gpath=(GeoDBPath) ctx.getBean("GPS");
		
		GeoServicesDB geodb=GeoServicesDB.getInstance(gpath.getFpath());
		ArrayList coords=geodb.getGPS(ip);
		
		double latitude=coords.get(0);
		double longitude=coords.get(1);
		
		AvyDAO avyobjs=(AvyDAO) ctx.getBean("AvyDAO");
				
		
		List avs=avyobjs.findAll(latitude,longitude);
		String outhtml=""; 
                //table head 
                  int i=0; 
                  for(Avalanche av: avs){ 
                    if((i%2)==0)
                   { 
                     //my table 
                    } else{ 
                     //my table 
                    } i++; 
               } 
               //end table 
               model.addAttribute("avyTable",outhtml.replaceAll("null","")); return "home"; 
          }

The Tools That Worked

Standard deviations, elementary statistics, and other basic statistics are handle=able using custom code. Fast clustering algorithms and more complex math that can be made more efficient is completed well with Apache’s common math.

Clustering is of particular interest. Commons math does not have affinity propagation but does have a quick k-means clusterer, a downer for wanting to discover patterns without known relations. However, the relations can be estimated using sqrt(n/2) centroids. This is the number affinity propagation often chooses. With this method, it is possible to obtain decent relations in the time taken to process a simple post request.

The Collection Process

Data collection resulted in an automated set of interrelated scrapers,ETL processes, and triggers. Scrapers were set up for nearly every reporting site available. This meant that the American North West, Alaska, California, and British Columbia were the only sites available for collection. The Colorado Avalanche Information Center and Utah’s avalanche center were the best in terms of data with Utah providing a wide range of attributes. This data was fed to weather collecting scrapers and finally to an ETL process. I wrapped the entire process in a Spring program and configuration file.

The Final Product and v2.0

My final product is a site that delivers reports on incidents, weather, and other key factors as well as the opportunity to cluster what little complete data there is in your region. A heat map and google map show the incident locations. I will hopefully include relevant date filters and eventually graphs and predictions as the data grows stronger and more numerous. Weather is currently averaged from two weeks before an avalanche event. However, this could grow to accommodate historical patterns. Weather data may be the only solidly available data at the present time and so will likely happen sooner than predictive features.

A Plea for Better Data From Avalanche Centers and Why No Predictions are Available

In the end, I was appauled by the lack of data. People die because they know nothing of the conditions generating avalanches. I myself have felt the force of a concussion wave rippling my clothes from half a mile away. This must end. Selling data should not take precedence over safety. My site suffers at the moment from poor reporting, a lack of publicly available clean data, and the result of simple mis-reportings not caught in ETL. My data set actually shrank in cleaning from 1000 to 300 avalanches across the entire NorthWest.

Still, weather data was incredibly public. The National Resource Conservation Service, which sells data, is a powerful tool when placed in combination with the National Atmospheric and Oceanic Society and Air Force weather station data sets.

Overall, I can only provide for public clustering because of this poor data. Over time, this may change as clusters become more distinct and patterns and predictions more powerful. However, I would feel personally responsible for someone’s untimely end at this point. I have tried running multiple regression on this topic before but the results were dismal. While better than 2011, data sets still need improvement.

The Future

I have no intention of stopping collection and will document my development work on the project here. I also plan to document any attempts to develop a device that uses the data it collects and my weather and/or other data to make predictions on the snowpack.

Morning Joe/Python PDF Part 3: Straight Optical Character Recognition

*Due to time constraints, I will be publishing large articles on the weekends with a daily small article for the time being.

Now, we start to delve into the PDF Images since the pdf text processing articles are quite popular. Not everything PDF is capable of being stripped using straight text conversion and the biggest headache is the PDF image. Luckily, our do “no evil” (heavy emphasis here) friends came up with tesseract, which, with training, is also quite good at breaking their own paid Captcha products to my great amusement and company’s profit.

A plethora of image pre-processing libraries and a bit of post-processing are still necessary when completing this task. Images must be of high enough contrast and large enough to make sense of. Basically, the algorithm consists of pre-processing an image, saving an image, using optical character recognition, and then performing clean-up tasks.

Saving Images Using Software and by Finding Stream Objects

For linux users, saving images from a pdf is best done with Poplar Utils which comes with Fedora,CentOS, and Ubuntu distributions and saves images to a specified directory. The command format is pdfimages [options] [pdf file path] [image root] . Options are included for specifying a starting page [-f int], an ending page [-l int], and more. Just type pdfimages into a linux terminal to see all of the options.


pdfimages -j /path/to/file.pdf /image/root/

To see if there are images just type pdfimages -list.

Windows users can use a similar command with the open source XPdf.

It is also possible to use the magic numbers I wrote about in a different article to find the images while iterating across the pdf stream objects and finding the starting and ending bytes of an image before writing them to a file using the commands from open().write(). A stream object is the way Adobe embeds objects in a pdf and is represented below. The find command can be used to ensure they exist and the regular expression command re.finditer(“(?mis)(?<=stream).*?(?=endstrem)",pdf) will find all of the streams.


stream

....our gibberish looking bytes....

endstream

Pre-Processing

Python offers a variety of extremely good tools via pillow that eliminate the need for hard-coded pre-processing as can be found with my image tools for Java.

Some of the features that pillow includes are:

  1. Blurring
  2. Contrast
  3. Edge Enhancement
  4. Smoothing

These classes should work for most pdfs. For more, I will be posting a decluttering algorithm in a Captcha Breaking post soon.

For resizing,OpenCV includes a module that avoids pixelation with a bit of math magic.


#! /usr/bin/python

import cv2

im=cv2.imread("impath")

im=cv2.resize(im,(im.shape[0]*2,im.shape[1]*2))

OCR with Tesseract

With a subprocess call or the use of pytesser (which includes faster support for Tesseract by implementing a subprocess call and ensuring reliability), it is possible to OCR the document.


#! /usr/bin/python

from PIL import Image

import pytesser

im=Image.open("fpath")

string=pytesser.image_to_string(im)

If the string comes out as complete garbage, try using the pre-processing modules to fix it or look at my image tools for ways to write custom algorithms.

Post-Processing

Unfortunately, Tesseract is not a commercial grade product for performing PDF OCR. It will require post processing. Fortunately, Python provides a terrific set of modules and capabilities for dealing with data quickly and effectively.

The regular expression module re, list comprehension, and substrings are useful in this case.

An example of post-processing would be (in continuance of the previous section):


import re

lines=string.split("\n")

lines=[x for x in lines if "bad stuff" not in x]

results=[]

for line in lines:

if re.search("pattern ",line):

results.append(re.sub("bad pattern","replacement",line))

Conclusion

It is definitely possible to obtain text using tesseract from a PDF. Post-processing is a requirement though. For documents that are proving difficult, commercial software is the best solution with Simple OCR and Abby Fine Reader offering quality solutions. Abby offers the best quality in my opinion but comes at a high price with a quote for an API for just reading PDF documents coming in at $5000. I have managed to use the Tesseract approach successfully at work but the process is time consuming and not guaranteed.

Morning Joe: Python-Is the Language Everyone is Trying to Create Already Around?

NodeJs, Ruby on Rails, VB Script and more are all seemingly targeted at creating an easier to code platform to replace Java,C++, and C# as dominant languages in certain spheres such as web and server development where quick deployment may mean more than a faster running language (Java and C# run at the same speed by the way). Each is hailed as a smaller, more compact language and yet each is fairly slow. It appears that the real title of simple to write replacement language that runs well on today’s fast and memory intensive machines came around in 1990. Meet Python, the utility knife language that can claim the title of “already done it” or “already on it” for just about every supposed advancement in programming for over the past 34 years. It is a strong scientific language in addition to being quick to deploy and easy to learn. Rather than bring about a comparison to the board game Othello, a second to learn but a lifetime to master, it is truly a language for all comers

The language is compact, weak typed, and can replace just about any Java code in half or even less lines of code.

Advantages

  1. Small and compact: most Python code is aimed at quick and easy deployment. Modules exist for just about anything (see the Python Package library or even your local Linux yum or rpm repository)
  2. Cross-platform
  3. Map, filter, and Reduce are already included in the standard modules.
  4. Weak typing allows for the guessing of types and reuse of variables more easily
  5. It supports nearly every server implementation and has for decades, from Apache to Nginx
  6. It works with Spring (included already in 2.7 or higher) and has a dedicated web framework in Djanjo
  7. Lambdas and list comprehension have existed well before Java 8, reducing code such as for loops to a single line.
  8. Python is both a scripting and object oriented language, giving access well beyond running code from the command line.
  9. Major organizations and key players have created applications ranging from the Google Apps Engine to OpenCV’s artificial intelligence algorithms and PyCuda for accessing the NVidia graphics card
  10. Engineers and scientists are increasingly using the language with microprocessor boards available at low cost (Raspberry Pi and the MicroPy boards are two of the Python based microprocessor boards)
  11. Modules are easy to deploy since all programs are already modules, much like with java classes accessing each other.
  12. Code can be deployed without actually pre-compiling anything (a benefit for quick access and a curse for runtime speed).
  13. Has extensive support for statistical and mathematical programming as well as graphing and image manipulation with MatPlotLib,OpenCv, Numpy, and SciPy
  14. Can manage and create subprocesses easily.
  15. Network Programming, web server programming with frameworks or cgi/fcgi, multi-processing, and other server development are less code intensive compared to C# and Java
  16. Installing new plugins can be done without an internet search from easy_install or the PIP repository and, on linux, from the rpm or yum repositories.
  17. Includes support for running other languages or can be used to control other runtime environments,the JVM included, using Cython for C,IronPython for .NET, and Jython for Java.
  18. Modules are available for most if not all databases from Oracle and PostgreSQL to MySQL as well as applications for big data such as Cassandra
  19. Developed by an open source community and monitored by a committee with a budding certification track co-developed by the publisher O’Reilly and the chair of the Python committee. The University of Washington also offers a certificate. I would consider this as more of a testament to the popularity and extensive coverage of the language as opposed to the solid certifications requiring extensive theoretical and/or technical knowledge from Cisco, CompTIA, or Oracle.
  20. Over 30 years of continuous development,even more than Java.

Disadvantages

  1. 1000 times slower than C or 100 times slower using CPython yet still comparable to or better than Ruby on Rails and the other “replacement” languages.
  2. Many modules require a variety of unstated dependencies discovered at installation time.
  3. Like Java, many of the activities require modules.

Morning Joe: Are Nesting Loop Comprehenders in Python Somehow Faster than Nested Loops?

As I learn more about Python at work, I have come across list comprehension functions and lambdas. The list comprehender format is [x for x in [] if condition] and the lambda format is (lambda x,y: function, x,y). The former seems extremely fast but is it faster than a nested loop when combined in a similar way?

Logic dictates that both will run in O(n^2) time. I performed a basic test on one, two, and four loops using the time module where a list of 100000 integers is created and then an iteration moves down the list, adding a random number to each slot (pretty easy but I need to get to working). Obviously, the third loop is skipped so that the triple nesting actually has an effect.

The programs are run 32 times apiece (CLT numbers), and averaged. The test computer is a Lenovo laptop with a 2.0 ghz dual core processor and 4 gb of RAM. The results are below.

One Loop

    #comprehension
    avg=0
    for i in range(0,32):
        list=[]
        for i in range(0,100000):
            list.append(random.randint(1,1000))
            
        start=time.time()
        list=[x+random.randint(1,100) for x in list]
        avg+=(time.time()-start)
    
    print "Comprehension Avg. Time "+str(avg/32)
    
    #loop
    avg=0
    for i in range(0,32):
        list=[]
        for i in range(0,100000):
            list.append(random.randint(1,1000))
        start=time.time()
        for i in range(0,len(list)):
            list[i]=list[i]+random.randint(1,100)
        avg+=time.time()-start
     
    print "Loop Avg. Time:"+str(avg/32)

Loop Time: .24775519222(seconds)
Comprehension Function Time: .246111199926 (seconds)

Doubly Nested Loop
In the presence of a time constraint, I have only included the loop code. The remaining code remains the same.

     #comprehension
     list=[x+random.randint(1,100) for x in [y+random.randint(1,100) for y in list]]

     #loop
     for i in range(0,2):
            for j in range(0,len(list)):
                list[j]=list[j]+random.randint(1,100)

Loop Time: 0.502061881125 (seconds)
Comprehension Function Time: 0.451432295144 (seconds)

Quadruple Nested Loop

     #comprehension
     list=[x+random.randint(1,100) for x in [y+random.randint(1,100) for y in [z+random.randint(1,100) for z in [a+random.randint(1,100) for a in list]]]]

     #loop
     for i in range(0,2):
            for j in range(0,2):
                for k in range(0,len(list)):
                    list[k]=list[k]+random.randint(1,100)

Loop Time: 1.08803078532 (seconds)
Comprehension Function Time: .951290503144(seconds)

Conclusion
As the time complexity measurements suggest, the time for each goes up with each level of nesting. However, the differences are extremely small. If any one has an advantage, it seems to be the comprehension but that difference is to small to declare a winner. That difference could be due to the random number function or other factors. The slight difference does seem to grow with each loop though. I would perform a more comprehensive test before making any determinations but comprehensions are faster to write anyway.