Private Immutable Configuration Hack in Python

Pyhton is notoriously not secure by default. However, it is still possible to generate a level of security through name mangling and other means. This article explains how to use name mangling, the singleton pattern, and class methods to create more secure access to configuration in Python3.

Singleton and Class Methods

Setting up a singleton in Python is simple:

class ChatConfig():

    __config = None

    class __Config:

        def __init__(self, config):
            self.config = config

    def set_config(cls, config):
        if cls.__config is None:
            cls.__config = config

    def get_config(cls):
        return cls.__config


The internal class and variable are mangled so as to make the variable itself private. This allows a single configuration to exist across the different packages while keeping the internal variable private and allowing for the variable to be set only once.


Python is not entirely insecure. It only takes code. This article offers an example of a way to set a variable once and share the setup among multiple packages.


ETL 1 Billion Rows in 2.5 Hours Without Paying on 4 cores and 7gb of RAM

There are a ton of ETL tools in the world. Alteryx, Tableau, Pentaho. This list goes on. Out of each, only Pentaho offers a quality free version. Alteryx prices can reach as high as $100,000 per year for a six person company and it is awful and awfully slow. Pentaho is not the greatest solution for streaming ETL either as it is not reactive but is a solid choice over the competitors.

How then, is it possible to ETL large datasets, stream on the same system from a TCP socket, or run flexible computations at speed. Surprisingly, this article will describe how to do just that using Celery and a tool which I am currently working on, CeleryETL.


Python is clearly an easy language to learn over others such as Scala, Java, and, of course, C++. These languages handle the vast majority of tasks for data science, AI, and mathematics outside of specialized languages such as R. They are likely the front runners in building production grade systems.

In place of the actor model popular with other languages, Python, being more arcane and outdated than any of the popular languages, requires task queues. My own foray into actor systems in Python led to a design which was, in fact, Celery backed by Python’s Thespian.

Celery handles tasks through RabbitMQ or other brokers claiming that the former can achieve up to 50 million messages per second. That is beyond the scope of this article but would theoretically cause my test case to outstrip the capacity of my database to write records. I only hazard to guess at what that would do to my file system.

Task queues are clunky, just like Python. Still, especially with modern hardware, they get the job done fast, blazingly fast. A task is queued with a module name specified as modules are loaded into a registry at run time. The queues, processed by a distributed set of workers running much like an actor in Akka, can be managed externally.

Celery allows for task streaming through chains and chords. The technical documentation is quite extensive and requires a decent chunk of time to get through.

Processing at Speed

Processing in Python at speed requires little more than properly chunking operations, batching record processing appropriately to remove latency, and performing other simple tasks as described in the Akka streams documentation. In fact, I wrote my layer on Celery using the Akka streams play book.

The only truly important operation, chunk your records. When streaming over TCP, this may not be necessary unless TCP connections happen extremely rapidly. Thresholding in this case may be an appropriate solution. If there are more connection attempts than can be completed at once, buffer requests and empty the buffer appropriately upon completion of each chain. I personally found that a maximum bucket size of 1000 for typical records was appropriate and 100 for large records including those containing text blobs was appropriate.

Take a look at my tool for implementation. However, I was able to remap,  split fields to rows, perform string operations, and write to my Neo4J graph database at anywhere from 80,000 to 120,000 records per second.


While this article is shorter than my others, it is something I felt necessary to write in the short time I have to write it. This discovery allows me to write a single language system through Celery, Neo4J, Django, PyQt, and PyTorch for an entire company. That, is phenomenal and only rivaled by Scala which is, sadly, dying despite being a far superior, faster, and less arcane language. By all measures, Scala should have won over the data science community but people detest the JVM. Until this changes, there is Celery.


The Very Sad and Disturbing State of JVM Based Sparse Matrix Packages

Big data is the rage, distribution is the rage, and so to is the growth of streaming data. The faster a company is, the better. Such speed requires, no demands, solid matrix performance. Worse yet, big data is inherently sparse and testing and implementation of new algorithms requires sparse matrices (CSR,CSC, COO; the like). Sadly, Java is not up to the task.

Let’s revisit some facts. Java is faster than Python at its core. Many tasks require looping over data in ways numpy or scipy simply do not support. A recent benchmark on Python3 v. Java highlights this. Worse, Python2 and Python3 use the global interpreter lock (GIL) making attempts at speed through concurrency often slower than single threading and forcing developers to use the dreaded multiprocessing (large clunky programs using only as many processes as cores). Still, multiprogramming and other modern operating systems concepts are helping Python achieve better albeit still quite slow speeds.

That said, Numpy and Scipy are the opposite of any of these things. They require the slower Cython but are written in C, performing blazingly fast, and leave all Java matrix libraries in the dust. In fact, in attempting to implement some basic machine learning tasks, I found myself not just writing things like text tiling which I fully expected to do but also starting down the path of creating a full fledged sparse matrix library with hashing library.

The following is the sad state of my Matrix tests.

The Libraries

The following libraries were used in the test:

The Cosines Test

An intensive test of a common use case is the calculation of the dot product (a dot b, a * b.t). Taking this result and dividing by norm(a)*norm(b) yields the cosine of pheta.

This simple test includes multiplication, transposing, and mapping division across all active values.

The Machine and Test Specs

The following machine specifications held:

  • CPU : Core i3 2.33 ghz
  • RAM : 8 GB (upgraded on laptop)
  • Environment: Eclipse
  • Test: Cosine Analysis
  • Languages: Scala and Python(scipy and numpy only)
  • Iterations: 32
  • Alloted Memory: 7gb either with x64 Python3 or -Xmx7g
  • Average: Strict non-smoothed average

The Scipy/Numpy Gold Standard

Not many open source libraries can claim the speed and power of the almighty Scipy and Numpy. The library can handle sparse matrices with m*n well over 1,000,000. More telling, it is fast. The calculation of the cosines is an extremely common practice in NLP and is a valuable similarity metric in most circumstances.

import scipy.sparse
import scipy.sparse.linalg

mat = sparse.rand(1000,50000,0.15)

Result : 5.13 seconds

The Results

The following resulted from each library:

  • Breeze : Crashes with Out of Memory Error (developer notified) [mat * mat.t]
  • UJMP : 128.73 seconds
  • MTJ : 285.13 seconds
  • La4j : 420.2 seconds
  • BidMach : Crashes (sprand(1000,50000,0.15))

Here is what the author of Breeze had to say. Rest assured, Numpy has been stable for over a decade now with constant improvements.


Java libraries are slow and seemingly undeserving of praise. Perhaps, due to the potential benefits of not necessarily distributing every calculation, they are not production grade. Promising libraries such as Nd4J/Nd4s still do not have a sparse matrix library and have claimed for some time to have one in the works. The alternatives are to use Python or program millions of lines of C code. While I find C fun, it is not a fast language to implement. Perhaps, for now Python will do. After all, PySpark is a little over 6 months old.

Python Unicode to UTF-8 Replacement Dictionary

I recently found an increasing need to replace Unicode characters with their English equivalents. This is in response to use of the ISO 8895-1 character set in html. Below is my dictionary for doing so and a code snippet for using it.

{"\x2013":"-","\x2014":"--","\x2018":"'","\x2019":"'","\x201A":",","\x201D":"~","\x2022":"*","\x2026":"...","\x2030":"%","\x2032":"'","\x2033":"`","\x2039":"","\x203E":"--","\x2044":"/","\x20AC":" euro ","\x2111":"i","\x2118":"P","\x2122":" TM ","\x2135":" alef ","\x2190":"","\x2193":" down-arrow ","\x2194":"","\x21B5":" crarr ","\x21D0":"","\x21D4":"","\x2200":"ALL","\x2202":" part ","\x2203":"EVERY","\x2205":"empty-set","\x2207":"nabla","\x2208":"isin","\x2209":"notin","\x2217":"*","\x221A":"sqrt","\x2329":"","\x25CA":" loz ","\x2660":"spades","\x2663":"clubs","\x2665":"hearts","\x2666":"diamonds","\x200C":" zwnj ","\x200D":" zwj ","\x200E":" lrm ","\x200F":" rlm ","\x27":"'","\xc2|\xA0|\x2002|\x2003|\x2009":" ","\x3E":">","\x3C":"> ","\xBC":"1/4","\xBD":"1/2","\xBE":"1/4","\xBE":"3/4","\xBF":" iquest "}

Multiple entities and a larger dictionary are provided below after an update to this function.

The code:

    def def encodeHTML(self,html,foreignKeys=None,replaceNonPrintable=False,multiEntities={"\xe2\x81\x91":"**","\xe2\x81\x95":"*","\xe2\x81\x97":'""',"\xe2\x81\xa0|\xe2\x80\x8b|\xe2\x80\x8c|\xe2\x80\x8d|\xe2\x80\x8e|‏\xe2\x80\x8f":"","\xe2\x80\x86|\xe2\x80\x87":"   ","\xe2\x80\x84|\xe2\x80\x85|\xe2\x80\x88":"  ","\xe2\x80\x8a|\xe2\x80\x89|\xe2\x80\x80|\xe2\x80\x81|\xe2\x80\x82|\xe2\x80\x82|\xe2\x80\x83":" ","\xe2\x80\x93|\xe2\x80\x92|\xe2\x80\x91|\xe2\x80\x90":"-","\xe2\x80\x96":"||","\xe2\x80\x95|\xe2\x80\x94":"--","\xe2\x81\x87":"??","\xe2\x81\x88":"?!","\xe2\x81\x89":"!?","\xe2\x81\x9d|\xe2\x81\x9e":":","\xe2\x81\x92":"-","\xe2\x81\x8b":" PILCROW ","\xe2\x80\xbc":"!!","\xe2\x80\xba":">","\xe2\x80\xb9":"<","\xe2\x80\xb8":"^","\xe2\x80\xb1":"%000","\xe2\x80\xb0":"%0","\xe2\x80\xa4|\xe2\x80\xa7":".","\xe2\x80\xa5":"..","\u2013":"-","\u2014":"--","\u2018":"'","\u2019":"'","\u201A":",","\u201D":"~","\u2022|\xe2\x80\xa3|\xe2\x80\xa2":"*"},replaceEntities={"\u2026":"...","\u2030":"%","\u2032":"'","\u2033":"`","\u2039":"","\u203E":"--","\u2044":"/","\u20AC":" euro ","\u2111":"i","\u2118":"P","\u2122":" TM ","\u2135":" alef ","\u2190":"","\u2193":" down-arrow ","\u2194":"","\u21B5":" crarr ","\u21D0":"","\u21D4":"","\u2200":"ALL","\u2202":" part ","\u2203":"EVERY","\u2205":"empty-set","\u2207":"nabla","\u2208":"isin","\u2209":"notin","\u2217":"*","\u221A":"sqrt","\u2329":"","\u25CA":" loz ","\u2660":"spades","\u2663":"clubs","\u2665":"hearts","\u2666":"diamonds","\u200C":" zwnj ","\u200D":" zwj ","\u200E":" lrm ","\u200F":" rlm ","\u27":"'","\xc2|\xA0|\u2002|\u2003|\u2009":" ","\x3E":">","\x3C":"> ","\uBC":"1/4","\xBD":"1/2","\xBE":"1/4","\xBE":"3/4","\xBF":" iquest "}):
        Encode HTML. Unfortunately
        *Required Parameters:
        :param html: html to run replacements on
        *Optional Parameters*
        :param multiEntities: entities represented my multiple unicode hex numbers 
        :param replaceNonPrintable: replace non printable characters after all other sets and encodings complete
        :param foreignKeys: dictionary of mapping to keys not in replaceEntities (for which are not in the default dict such as foreign letters e.g. {"\xF8":"[oslash]"} )
        :param replaceEntities: a list of entities to replace such as ,copyright symboles, micro; etc. that may be in the ISO or other format converted to unicode Hex formats (see non Latin characters at
        if multiEntities is not None:
            for k in multiEntities.keys():
        if replaceEntities is not None:
            for k in replaceEntities.keys():
        if replaceNonPrintable is True:
            import string
            html=filter(lambda x: x in string.printable,html)    
        if foreignKeys is not None:
            for k in foreignKeys.keys():
        return html

Morning Joe/Python PDF Part 3: Straight Optical Character Recognition

*Due to time constraints, I will be publishing large articles on the weekends with a daily small article for the time being.

Now, we start to delve into the PDF Images since the pdf text processing articles are quite popular. Not everything PDF is capable of being stripped using straight text conversion and the biggest headache is the PDF image. Luckily, our do “no evil” (heavy emphasis here) friends came up with tesseract, which, with training, is also quite good at breaking their own paid Captcha products to my great amusement and company’s profit.

A plethora of image pre-processing libraries and a bit of post-processing are still necessary when completing this task. Images must be of high enough contrast and large enough to make sense of. Basically, the algorithm consists of pre-processing an image, saving an image, using optical character recognition, and then performing clean-up tasks.

Saving Images Using Software and by Finding Stream Objects

For linux users, saving images from a pdf is best done with Poplar Utils which comes with Fedora,CentOS, and Ubuntu distributions and saves images to a specified directory. The command format is pdfimages [options] [pdf file path] [image root] . Options are included for specifying a starting page [-f int], an ending page [-l int], and more. Just type pdfimages into a linux terminal to see all of the options.

pdfimages -j /path/to/file.pdf /image/root/

To see if there are images just type pdfimages -list.

Windows users can use a similar command with the open source XPdf.

It is also possible to use the magic numbers I wrote about in a different article to find the images while iterating across the pdf stream objects and finding the starting and ending bytes of an image before writing them to a file using the commands from open().write(). A stream object is the way Adobe embeds objects in a pdf and is represented below. The find command can be used to ensure they exist and the regular expression command re.finditer(“(?mis)(?<=stream).*?(?=endstrem)",pdf) will find all of the streams.


....our gibberish looking bytes....



Python offers a variety of extremely good tools via pillow that eliminate the need for hard-coded pre-processing as can be found with my image tools for Java.

Some of the features that pillow includes are:

  1. Blurring
  2. Contrast
  3. Edge Enhancement
  4. Smoothing

These classes should work for most pdfs. For more, I will be posting a decluttering algorithm in a Captcha Breaking post soon.

For resizing,OpenCV includes a module that avoids pixelation with a bit of math magic.

#! /usr/bin/python

import cv2



OCR with Tesseract

With a subprocess call or the use of pytesser (which includes faster support for Tesseract by implementing a subprocess call and ensuring reliability), it is possible to OCR the document.

#! /usr/bin/python

from PIL import Image

import pytesser"fpath")


If the string comes out as complete garbage, try using the pre-processing modules to fix it or look at my image tools for ways to write custom algorithms.


Unfortunately, Tesseract is not a commercial grade product for performing PDF OCR. It will require post processing. Fortunately, Python provides a terrific set of modules and capabilities for dealing with data quickly and effectively.

The regular expression module re, list comprehension, and substrings are useful in this case.

An example of post-processing would be (in continuance of the previous section):

import re


lines=[x for x in lines if "bad stuff" not in x]


for line in lines:

if"pattern ",line):

results.append(re.sub("bad pattern","replacement",line))


It is definitely possible to obtain text using tesseract from a PDF. Post-processing is a requirement though. For documents that are proving difficult, commercial software is the best solution with Simple OCR and Abby Fine Reader offering quality solutions. Abby offers the best quality in my opinion but comes at a high price with a quote for an API for just reading PDF documents coming in at $5000. I have managed to use the Tesseract approach successfully at work but the process is time consuming and not guaranteed.

Morning Joe: Python-Is the Language Everyone is Trying to Create Already Around?

NodeJs, Ruby on Rails, VB Script and more are all seemingly targeted at creating an easier to code platform to replace Java,C++, and C# as dominant languages in certain spheres such as web and server development where quick deployment may mean more than a faster running language (Java and C# run at the same speed by the way). Each is hailed as a smaller, more compact language and yet each is fairly slow. It appears that the real title of simple to write replacement language that runs well on today’s fast and memory intensive machines came around in 1990. Meet Python, the utility knife language that can claim the title of “already done it” or “already on it” for just about every supposed advancement in programming for over the past 34 years. It is a strong scientific language in addition to being quick to deploy and easy to learn. Rather than bring about a comparison to the board game Othello, a second to learn but a lifetime to master, it is truly a language for all comers

The language is compact, weak typed, and can replace just about any Java code in half or even less lines of code.


  1. Small and compact: most Python code is aimed at quick and easy deployment. Modules exist for just about anything (see the Python Package library or even your local Linux yum or rpm repository)
  2. Cross-platform
  3. Map, filter, and Reduce are already included in the standard modules.
  4. Weak typing allows for the guessing of types and reuse of variables more easily
  5. It supports nearly every server implementation and has for decades, from Apache to Nginx
  6. It works with Spring (included already in 2.7 or higher) and has a dedicated web framework in Djanjo
  7. Lambdas and list comprehension have existed well before Java 8, reducing code such as for loops to a single line.
  8. Python is both a scripting and object oriented language, giving access well beyond running code from the command line.
  9. Major organizations and key players have created applications ranging from the Google Apps Engine to OpenCV’s artificial intelligence algorithms and PyCuda for accessing the NVidia graphics card
  10. Engineers and scientists are increasingly using the language with microprocessor boards available at low cost (Raspberry Pi and the MicroPy boards are two of the Python based microprocessor boards)
  11. Modules are easy to deploy since all programs are already modules, much like with java classes accessing each other.
  12. Code can be deployed without actually pre-compiling anything (a benefit for quick access and a curse for runtime speed).
  13. Has extensive support for statistical and mathematical programming as well as graphing and image manipulation with MatPlotLib,OpenCv, Numpy, and SciPy
  14. Can manage and create subprocesses easily.
  15. Network Programming, web server programming with frameworks or cgi/fcgi, multi-processing, and other server development are less code intensive compared to C# and Java
  16. Installing new plugins can be done without an internet search from easy_install or the PIP repository and, on linux, from the rpm or yum repositories.
  17. Includes support for running other languages or can be used to control other runtime environments,the JVM included, using Cython for C,IronPython for .NET, and Jython for Java.
  18. Modules are available for most if not all databases from Oracle and PostgreSQL to MySQL as well as applications for big data such as Cassandra
  19. Developed by an open source community and monitored by a committee with a budding certification track co-developed by the publisher O’Reilly and the chair of the Python committee. The University of Washington also offers a certificate. I would consider this as more of a testament to the popularity and extensive coverage of the language as opposed to the solid certifications requiring extensive theoretical and/or technical knowledge from Cisco, CompTIA, or Oracle.
  20. Over 30 years of continuous development,even more than Java.


  1. 1000 times slower than C or 100 times slower using CPython yet still comparable to or better than Ruby on Rails and the other “replacement” languages.
  2. Many modules require a variety of unstated dependencies discovered at installation time.
  3. Like Java, many of the activities require modules.

Morning Joe: Are Nesting Loop Comprehenders in Python Somehow Faster than Nested Loops?

As I learn more about Python at work, I have come across list comprehension functions and lambdas. The list comprehender format is [x for x in [] if condition] and the lambda format is (lambda x,y: function, x,y). The former seems extremely fast but is it faster than a nested loop when combined in a similar way?

Logic dictates that both will run in O(n^2) time. I performed a basic test on one, two, and four loops using the time module where a list of 100000 integers is created and then an iteration moves down the list, adding a random number to each slot (pretty easy but I need to get to working). Obviously, the third loop is skipped so that the triple nesting actually has an effect.

The programs are run 32 times apiece (CLT numbers), and averaged. The test computer is a Lenovo laptop with a 2.0 ghz dual core processor and 4 gb of RAM. The results are below.

One Loop

    for i in range(0,32):
        for i in range(0,100000):
        list=[x+random.randint(1,100) for x in list]
    print "Comprehension Avg. Time "+str(avg/32)
    for i in range(0,32):
        for i in range(0,100000):
        for i in range(0,len(list)):
    print "Loop Avg. Time:"+str(avg/32)

Loop Time: .24775519222(seconds)
Comprehension Function Time: .246111199926 (seconds)

Doubly Nested Loop
In the presence of a time constraint, I have only included the loop code. The remaining code remains the same.

     list=[x+random.randint(1,100) for x in [y+random.randint(1,100) for y in list]]

     for i in range(0,2):
            for j in range(0,len(list)):

Loop Time: 0.502061881125 (seconds)
Comprehension Function Time: 0.451432295144 (seconds)

Quadruple Nested Loop

     list=[x+random.randint(1,100) for x in [y+random.randint(1,100) for y in [z+random.randint(1,100) for z in [a+random.randint(1,100) for a in list]]]]

     for i in range(0,2):
            for j in range(0,2):
                for k in range(0,len(list)):

Loop Time: 1.08803078532 (seconds)
Comprehension Function Time: .951290503144(seconds)

As the time complexity measurements suggest, the time for each goes up with each level of nesting. However, the differences are extremely small. If any one has an advantage, it seems to be the comprehension but that difference is to small to declare a winner. That difference could be due to the random number function or other factors. The slight difference does seem to grow with each loop though. I would perform a more comprehensive test before making any determinations but comprehensions are faster to write anyway.