Private Immutable Configuration Hack in Python

Pyhton is notoriously not secure by default. However, it is still possible to generate a level of security through name mangling and other means. This article explains how to use name mangling, the singleton pattern, and class methods to create more secure access to configuration in Python3.

Singleton and Class Methods

Setting up a singleton in Python is simple:

class ChatConfig():

    __config = None

    class __Config:

        def __init__(self, config):
            self.config = config

    @classmethod
    def set_config(cls, config):
        if cls.__config is None:
            cls.__config = config


    @classmethod
    def get_config(cls):
        return cls.__config

 

The internal class and variable are mangled so as to make the variable itself private. This allows a single configuration to exist across the different packages while keeping the internal variable private and allowing for the variable to be set only once.

Conclusion

Python is not entirely insecure. It only takes code. This article offers an example of a way to set a variable once and share the setup among multiple packages.

Advertisements

Two Step Verification in a Flask REST App

flask

Flask is great. It is simple, easy, and allows for lightning fast deployment. However, there are a few security problems that should be worked out before using it in production.

This article examines how to deploy two step verificatiom and ip and mac address tracking alongside JWT tokens in Flask.

Code for this article is on my Github.

OS and Hardware Security

Software is just a series of electrons floating around the Internet. Fans, special devices for man in the middle attacks, and general human ignorance can all circumvent good practices.

Some things that should be done prior to development are:

  • Assign proper roles to users with appropriate security measures
  • Setup IP tables and other forms of firewall protection
  • Don’t randomly open ports to the world
  • Isolate unprotected devices from those handling highly secure data (a web server from your ETL servers for instance)
  • Ensure passwords are fairly secure (8-20 memorable characters avoiding certain others)
  • Use endpoint security such as RSA keys where appropriate

Proper Security

Like all things security, articles should not promote a version of encryption as secure or make claims using algorithms that could be rendered useless even as I write. All good algorithms sour

I can, however, provide a list of algorithms to not use:

  • bcrypt
  • blowfish

Remember, that all algorithms are usually broken. The US government currently lists AES as use-able and pbkdf can render sha512 useful. SHA512 is currently promoted as a good algorithm by NIST. 

JWT in Flask

JWT tokens are useful in that they store the information necessary to keep a user logged in. They are great for single page applications where session tracking might be in-appropriate. Know your use case.

A strong and configurable tool for implementing JWT keys in Flask is flask_jwt_extended which rides on the Flask-Security module.

Implementing JWT is fairly simple:

from flask import Flask
from flask_jwt_extended import JWTManager, jwt_required

app = Flask(__name__)
jwt = JWTManager(app)
jwt.init_app(app)

@app.route('/login', methods=['POST'])
def login():
    access_token = create_access_token(identity=username)
    return jsonify(access_token=access_token)

@app.route('/is_working')
@jwt_required
def is_working():
    return json.dumps({'Success': True}), 200, {'ContentType': 'application/json'}

It appears that Flask-Security was recently fixed so that password hashing works appropriately once again.

from flask_security import Security
from flask_security.utils import encrypt_password, verify_password

security = Security(app, datastore)
pwd = encrypt_password("test")
if verify_password("test", pwd):
    print("Verified")

Email Server

Before discussing two step verification, it is necessary to setup a test email server and be able to send emails. The smtplib offers the functionality of a web server in a simple configurable Python application. I personally printed out the input so will not post the code here. The Python docs are a good place to get started.

Sending emails can be done through smtplib or Flask-Mail. The smtplib library will be more flexible.

The following sets up a smptlib for sending an email:

import smtplib

....

host = email_config['host']
port = email_config['port']
email_server = smtplib.SMTP(host, port)
if email_config.get('ehlo', False):
    email_server.ehlo()
if email_config.get('start_tls', False):
    certfile = email_config.get('tls_cert', None)
    keyfile = email_config.get('tls_key', None)
    context = email_config.get('context', None)
    email_server.starttls(keyfile, certfile, context)
if email_config.get('user') and email_config.get('password'):
    user = email_config.get('user')
    password = email_config.get('password')
    email_server.login(user, password)
...
email_server.sendmail(recipient, [sender], msg.as_string())

Many different options are configurable using smtplib. These settings can be set using Flask-Mail but any code needed to help perform setup might be an issue.

Two Step Verification

It is now possible to extend the login function to include multi step authorization. The important pieces of the puzzle are obtaining an ip and/or mac address, verifying a password as shown, sending an email with a verification code, handling receipt of the code, and persistence.

Most of this is shown in my own open source project. This code uses uuid to generate a unique code:

import uuid
...
code = uuid.uuid4()

This code is hashed as before and stored using SQLAlchemy.

The basic process followed in my Github code is:

  1. Use login() to retrieve the JWT key and check for a matching mac address and ip
  2. Send an email verification code as needed
  3. Through verify_ip_code and verify_mac_code the code is validated and databases updated

The login function contains the majority of calls for two step verification.

Conclusion

This article examined the basics required to create two step verification in Python using Flask using examples and code from my Github repository.

It is important to use the most up to date algorithms. This article made no attempt to recommend an encryption algorithm.

ETL 1 Billion Rows in 2.5 Hours Without Paying on 4 cores and 7gb of RAM

There are a ton of ETL tools in the world. Alteryx, Tableau, Pentaho. This list goes on. Out of each, only Pentaho offers a quality free version. Alteryx prices can reach as high as $100,000 per year for a six person company and it is awful and awfully slow. Pentaho is not the greatest solution for streaming ETL either as it is not reactive but is a solid choice over the competitors.

How then, is it possible to ETL large datasets, stream on the same system from a TCP socket, or run flexible computations at speed. Surprisingly, this article will describe how to do just that using Celery and a tool which I am currently working on, CeleryETL.

Celery

Python is clearly an easy language to learn over others such as Scala, Java, and, of course, C++. These languages handle the vast majority of tasks for data science, AI, and mathematics outside of specialized languages such as R. They are likely the front runners in building production grade systems.

In place of the actor model popular with other languages, Python, being more arcane and outdated than any of the popular languages, requires task queues. My own foray into actor systems in Python led to a design which was, in fact, Celery backed by Python’s Thespian.

Celery handles tasks through RabbitMQ or other brokers claiming that the former can achieve up to 50 million messages per second. That is beyond the scope of this article but would theoretically cause my test case to outstrip the capacity of my database to write records. I only hazard to guess at what that would do to my file system.

Task queues are clunky, just like Python. Still, especially with modern hardware, they get the job done fast, blazingly fast. A task is queued with a module name specified as modules are loaded into a registry at run time. The queues, processed by a distributed set of workers running much like an actor in Akka, can be managed externally.

Celery allows for task streaming through chains and chords. The technical documentation is quite extensive and requires a decent chunk of time to get through.

Processing at Speed

Processing in Python at speed requires little more than properly chunking operations, batching record processing appropriately to remove latency, and performing other simple tasks as described in the Akka streams documentation. In fact, I wrote my layer on Celery using the Akka streams play book.

The only truly important operation, chunk your records. When streaming over TCP, this may not be necessary unless TCP connections happen extremely rapidly. Thresholding in this case may be an appropriate solution. If there are more connection attempts than can be completed at once, buffer requests and empty the buffer appropriately upon completion of each chain. I personally found that a maximum bucket size of 1000 for typical records was appropriate and 100 for large records including those containing text blobs was appropriate.

Take a look at my tool for implementation. However, I was able to remap,  split fields to rows, perform string operations, and write to my Neo4J graph database at anywhere from 80,000 to 120,000 records per second.

Conclusion

While this article is shorter than my others, it is something I felt necessary to write in the short time I have to write it. This discovery allows me to write a single language system through Celery, Neo4J, Django, PyQt, and PyTorch for an entire company. That, is phenomenal and only rivaled by Scala which is, sadly, dying despite being a far superior, faster, and less arcane language. By all measures, Scala should have won over the data science community but people detest the JVM. Until this changes, there is Celery.

 

The Very Sad and Disturbing State of JVM Based Sparse Matrix Packages

Big data is the rage, distribution is the rage, and so to is the growth of streaming data. The faster a company is, the better. Such speed requires, no demands, solid matrix performance. Worse yet, big data is inherently sparse and testing and implementation of new algorithms requires sparse matrices (CSR,CSC, COO; the like). Sadly, Java is not up to the task.

Let’s revisit some facts. Java is faster than Python at its core. Many tasks require looping over data in ways numpy or scipy simply do not support. A recent benchmark on Python3 v. Java highlights this. Worse, Python2 and Python3 use the global interpreter lock (GIL) making attempts at speed through concurrency often slower than single threading and forcing developers to use the dreaded multiprocessing (large clunky programs using only as many processes as cores). Still, multiprogramming and other modern operating systems concepts are helping Python achieve better albeit still quite slow speeds.

That said, Numpy and Scipy are the opposite of any of these things. They require the slower Cython but are written in C, performing blazingly fast, and leave all Java matrix libraries in the dust. In fact, in attempting to implement some basic machine learning tasks, I found myself not just writing things like text tiling which I fully expected to do but also starting down the path of creating a full fledged sparse matrix library with hashing library.

The following is the sad state of my Matrix tests.

The Libraries

The following libraries were used in the test:

The Cosines Test

An intensive test of a common use case is the calculation of the dot product (a dot b, a * b.t). Taking this result and dividing by norm(a)*norm(b) yields the cosine of pheta.

This simple test includes multiplication, transposing, and mapping division across all active values.

The Machine and Test Specs

The following machine specifications held:

  • CPU : Core i3 2.33 ghz
  • RAM : 8 GB (upgraded on laptop)
  • Environment: Eclipse
  • Test: Cosine Analysis
  • Languages: Scala and Python(scipy and numpy only)
  • Iterations: 32
  • Alloted Memory: 7gb either with x64 Python3 or -Xmx7g
  • Average: Strict non-smoothed average

The Scipy/Numpy Gold Standard

Not many open source libraries can claim the speed and power of the almighty Scipy and Numpy. The library can handle sparse matrices with m*n well over 1,000,000. More telling, it is fast. The calculation of the cosines is an extremely common practice in NLP and is a valuable similarity metric in most circumstances.

import scipy.sparse
import scipy.sparse.linalg

mat = sparse.rand(1000,50000,0.15)
print scipy.sparse.dot(mat,mat.t)/pow(linalg.norm(mat),2)

Result : 5.13 seconds

The Results

The following resulted from each library:

  • Breeze : Crashes with Out of Memory Error (developer notified) [mat * mat.t]
  • UJMP : 128.73 seconds
  • MTJ : 285.13 seconds
  • La4j : 420.2 seconds
  • BidMach : Crashes (sprand(1000,50000,0.15))

Here is what the author of Breeze had to say. Rest assured, Numpy has been stable for over a decade now with constant improvements.

Conclusion

Java libraries are slow and seemingly undeserving of praise. Perhaps, due to the potential benefits of not necessarily distributing every calculation, they are not production grade. Promising libraries such as Nd4J/Nd4s still do not have a sparse matrix library and have claimed for some time to have one in the works. The alternatives are to use Python or program millions of lines of C code. While I find C fun, it is not a fast language to implement. Perhaps, for now Python will do. After all, PySpark is a little over 6 months old.

A Proposed Improvement to Decision Trees, Sort of A Neural Net

Decision trees remove a lot of power from themselves by having a root node. They are trees after all and who is to say that one starting point for most data holds for all data. There may be an alternate way to handle them, one I will test when I have time against an appropriate dataset such as legal case data. The proposed solution is simple, change the structure. Don’t use a decision tree, use a decision graph. Every node has the ability to be chosen. Also, Don’t just create one vertex, create many vertices.

Feasibility

This is actually a more feasible solution than it appears. By training a neural net to choose the correct node to start with and specifying correct paths from each node, the old (n)*(n-1) edge relationship to the number of nodes doesn’t need to be a reality. This also retains a decision tree like shape. Basically, a decision graph.

Positives

We now have a more manual hybrid between a neural net, decision tree, and graph that still needs to be tested. However, we also have man different possible outcomes without relying on one root node. This is a much better starting point and many more than just a single path.

Negatives

This is still a manual process whose best outcome is probably connecting every node to every other node and training outcomes on the best path. It can also be extremely calculation intensive to get right.

Conclusion

For now, just that decision trees are limited and not really the ideal solution in most circumstances.

Python Unicode to UTF-8 Replacement Dictionary

I recently found an increasing need to replace Unicode characters with their English equivalents. This is in response to use of the ISO 8895-1 character set in html. Below is my dictionary for doing so and a code snippet for using it.

{"\x2013":"-","\x2014":"--","\x2018":"'","\x2019":"'","\x201A":",","\x201D":"~","\x2022":"*","\x2026":"...","\x2030":"%","\x2032":"'","\x2033":"`","\x2039":"","\x203E":"--","\x2044":"/","\x20AC":" euro ","\x2111":"i","\x2118":"P","\x2122":" TM ","\x2135":" alef ","\x2190":"","\x2193":" down-arrow ","\x2194":"","\x21B5":" crarr ","\x21D0":"","\x21D4":"","\x2200":"ALL","\x2202":" part ","\x2203":"EVERY","\x2205":"empty-set","\x2207":"nabla","\x2208":"isin","\x2209":"notin","\x2217":"*","\x221A":"sqrt","\x2329":"","\x25CA":" loz ","\x2660":"spades","\x2663":"clubs","\x2665":"hearts","\x2666":"diamonds","\x200C":" zwnj ","\x200D":" zwj ","\x200E":" lrm ","\x200F":" rlm ","\x27":"'","\xc2|\xA0|\x2002|\x2003|\x2009":" ","\x3E":">","\x3C":"> ","\xBC":"1/4","\xBD":"1/2","\xBE":"1/4","\xBE":"3/4","\xBF":" iquest "}

Multiple entities and a larger dictionary are provided below after an update to this function.

The code:

    def def encodeHTML(self,html,foreignKeys=None,replaceNonPrintable=False,multiEntities={"\xe2\x81\x91":"**","\xe2\x81\x95":"*","\xe2\x81\x97":'""',"\xe2\x81\xa0|\xe2\x80\x8b|\xe2\x80\x8c|\xe2\x80\x8d|\xe2\x80\x8e|‏\xe2\x80\x8f":"","\xe2\x80\x86|\xe2\x80\x87":"   ","\xe2\x80\x84|\xe2\x80\x85|\xe2\x80\x88":"  ","\xe2\x80\x8a|\xe2\x80\x89|\xe2\x80\x80|\xe2\x80\x81|\xe2\x80\x82|\xe2\x80\x82|\xe2\x80\x83":" ","\xe2\x80\x93|\xe2\x80\x92|\xe2\x80\x91|\xe2\x80\x90":"-","\xe2\x80\x96":"||","\xe2\x80\x95|\xe2\x80\x94":"--","\xe2\x81\x87":"??","\xe2\x81\x88":"?!","\xe2\x81\x89":"!?","\xe2\x81\x9d|\xe2\x81\x9e":":","\xe2\x81\x92":"-","\xe2\x81\x8b":" PILCROW ","\xe2\x80\xbc":"!!","\xe2\x80\xba":">","\xe2\x80\xb9":"<","\xe2\x80\xb8":"^","\xe2\x80\xb1":"%000","\xe2\x80\xb0":"%0","\xe2\x80\xa4|\xe2\x80\xa7":".","\xe2\x80\xa5":"..","\u2013":"-","\u2014":"--","\u2018":"'","\u2019":"'","\u201A":",","\u201D":"~","\u2022|\xe2\x80\xa3|\xe2\x80\xa2":"*"},replaceEntities={"\u2026":"...","\u2030":"%","\u2032":"'","\u2033":"`","\u2039":"","\u203E":"--","\u2044":"/","\u20AC":" euro ","\u2111":"i","\u2118":"P","\u2122":" TM ","\u2135":" alef ","\u2190":"","\u2193":" down-arrow ","\u2194":"","\u21B5":" crarr ","\u21D0":"","\u21D4":"","\u2200":"ALL","\u2202":" part ","\u2203":"EVERY","\u2205":"empty-set","\u2207":"nabla","\u2208":"isin","\u2209":"notin","\u2217":"*","\u221A":"sqrt","\u2329":"","\u25CA":" loz ","\u2660":"spades","\u2663":"clubs","\u2665":"hearts","\u2666":"diamonds","\u200C":" zwnj ","\u200D":" zwj ","\u200E":" lrm ","\u200F":" rlm ","\u27":"'","\xc2|\xA0|\u2002|\u2003|\u2009":" ","\x3E":">","\x3C":"> ","\uBC":"1/4","\xBD":"1/2","\xBE":"1/4","\xBE":"3/4","\xBF":" iquest "}):
        '''
        Encode HTML. Unfortunately
        
        *Required Parameters:
        
        :param html: html to run replacements on
        
        *Optional Parameters*
        :param multiEntities: entities represented my multiple unicode hex numbers 
        :param replaceNonPrintable: replace non printable characters after all other sets and encodings complete
        :param foreignKeys: dictionary of mapping to keys not in replaceEntities (for which are not in the default dict such as foreign letters e.g. {"\xF8":"[oslash]"} )
        :param replaceEntities: a list of entities to replace such as ,copyright symboles, micro; etc. that may be in the ISO or other format converted to unicode Hex formats (see non Latin characters at https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references)
         
        '''
        if multiEntities is not None:
            for k in multiEntities.keys():
                html=re.sub(k,multiEntities[k],html)
                
        if replaceEntities is not None:
            for k in replaceEntities.keys():
                html=re.sub(k,replaceEntities[k],html)
        
        html=HTMLParser.HTMLParser().unescape(Soup(Soup(html).encode()).prettify())
        
        
        if replaceNonPrintable is True:
            import string
            html=filter(lambda x: x in string.printable,html)    
        
        if foreignKeys is not None:
            for k in foreignKeys.keys():
                html=re.sub(k,foreignKeys[k],html)
        
        return html

Python PDF 3: Writing With HTML and XML

Alas, I have discovered the potent mixture of Jinja, weasyprint and Pandas. Mixing these tools with matplotlib and Python image modules yields a way to write PDF documents with relative ease and with the styling help of HTML. It would also be able to use a tool like xmltopdf for generating pdf files from XML. Previous Posts dealt with this using a more complicated tool, PyPDF2.

A Basic HTML Template

In this tutorial, I am using jinja to create tables. My tables will not have much in the way of styling but it is also possible to add styles with jinja or by using a tool such as Django-Tables2. Both tools are incredibly similar to the Django platform.

A template is needed in order to generate HTML pages for conversion to pdf format. Jinja follows a basic format with double curly braces used to mark where items are entered encapsulating the title of the property.


<!DOCTYPE html>
<html>
<head lang="en">
<meta charset="UTF-8">
<title>{{ title }}</title>
</head>
<body>
<div>
<h1>Weekly Summary Report</h1>
{{ summary_pivot_table }}
</div>

<div>
<h1>Frequency Report</h1>
{{ frequency_table }}
</div>

<div>
<h1>Weekly Source Reports</h1>
{{ source_pivot_table }}
</div>
</body>
</html>

In this case, there is a title and three reports. It would be easy to add CSS tags and generate different styles using the division tags. These will be converted by weasyprint later.

Writing to the Template

Writing to a template with Jinja requires using the dictionary data structure.

adminVars={"title":"Weekly Statistics","frequency_table":freqFrame.to_html(),"summary_pivot_table":sframe.to_html(),"source_pivot_table":tframe.to_html()}

Generating Data

Generating data is simple with Pandas. This is especially true with databases. One only needs to connect to a database using a SQLAlchemy engine and perform any necessary query. It is also possible to concatenate as many queries as necessary to generate a table.

import sqlalchemy
import pandas

#create alchemy engine      

dsn='postgresql+psycopg2://'+cfp.getvar("db","user","string")+":"+cfp.getVar("db", "passw","string")+"@"+cfp.getVar("db","host","string")+":"+cfp.getVar
("db","port","string")+"/"+cfp.getVar("db","dbname","string")

engine=sqlalchemy.create_engine(dsn)
        
#get totals table
query=cfp.getVar("sql","weekly_totals_complete","string")
tframe=pandas.read_sql_query(query,engine)

Concatenation is not difficult either using the concat function.

pandas.concat([pandas.read_sql_query(query,engine) for query in tables])

New columns will be generated with NaN values.

Performing Basic Operations on Dataframes

Performing operations on dataframes is easy with numpy or scipy.

import numpy

#operate on tframe from above
tframe.apply(numpy.average,axis=0)

Dataframes themselves have operations that can be formed on them and use numpy.

#tframe from above
tframe.mean()

A list of operations is provided in the Pandas documentation.

More complicated operations may require unpacking the values or using generator functions

Using Weazy Print

Once the resources and template are prepared, simply call on weazy print to convert the html resulting from the template to a PDF.

An extra import is needed to fetch resources such as images from links embedded within the url.

Otherwise, generate a pandas data frame, conver the frame to html and place as the value attached to the appropriate template key in your dictionary and then convert. The example code uses SQLAlchemy to fetch resources from a PostgreSQL database.

from crawleraids.ConfigVars import Config
from jinja2 import Environment,FileSystemLoader
import pandas
from weasyprint import HTML,default_url_fetcher
import sqlalchemy

def fetchURL(url):
   '''
   Provide a resource obtainer for getting urls to weazy print
   '''
   return weasyprint.default_url_fetcher(url)


def generatePDF(fpath):
       '''
       Generate the pdf.
       '''
       #create alchemy engine
       cfp=Config(fpath)
       
dsn='postgresql+psycopg2://'+cfp.getvar("db","user","string")+":"+cfp.getVar("db", "passw","string")+"@"+cfp.getVar("db","host","string")+":"+cfp.getVar("db","port","string")+"/"+cfp.getVar("db","dbname","string")
        engine=sqlalchemy.create_engine(dsn)
        
        #get totals table
        query=cfp.getVar("sql","weekly_totals_complete","string")
        tframe=pandas.read_sql_query(query,engine)
        
        #get summary stats table
        query=cfp.getVar("sql","weekly_summary_complete","string")
        sframe=pandas.read_sql_query(query,engine)
        
        #get frequencies
        query=cfp.getVar("sql","weekly_frequency","string")
        freqFrame=pandas.read_sql_query(query,engine)
         
        #get the resource loader 
        env=Environment(loader=FileSystemLoader('.'))
        template=env.get_template("aboveTemplate.html")
         
       #fill the template
       vars={"title":"Weekly Statistics","frequency_table":freqFrame.to_html(),"summary_pivot_table":sframe.to_html(),"source_pivot_table":tframe.to_html()}
       html=template.render(vars)
       
       #use weazy print to convert to pdf
       HTML(string=html).write_pdf(target="/reports/report.pdf",stylesheets=cfp.getVar("style","aboveTemplate","string"))

All of the power of Pandas is now at the disposal of the programmer along with anything that can be embedded in a url.

Generating and Saving Graphs with Pandas and Matplotlib or PyPlot

It is possible to embed graphs into a pdf by saving them as images.

Obviously, the folks behind pdf allow most things made of bytes to be placed in objects in a PDF (a pdf is a series of pdf objects much like xml with byte strings in base 64 as the text). See my magic numbers post and try to parse or write your own image to a pdf if you really want to dive into the subject.

Generating graphs is simple Pandas. Just make sure to match the template graph with the image url.

import matplotlib.pyplot as plt

data=[[1,3,5,2],[1,3,4]] #perform operations on the data to transform the graph. Each array is a new plot line.
df = DataFrame(data,columns=[['PlotA','PlotB']))
fig=df.plot()
fig=fig.get_figure()
fig.savefig('graph.png') #also can save as own pdf to be merged as described in an earlier post

It is possible to do this directly with pyplot as well.

Using Flask to Create a PDF Web Server

It appears from comments and questions that pfd servers are often a request. The clunkiness of Spring can now be replaced easily with the combination of the mentioned tools and the Flask web framework. These tools allow for the quick and easy creation of a pdf web server. However, asyncore with socket, Spring with Java based tools, or other tools will need to be run if the plan is to use something akin to the proxy pattern, a sad state of affairs.

To create the server, simply create a method with an annotation specifying the path, much as would happen in spring.

from flask import Flask
try:
   from cStringIO import StringIO
except:
   import StringIO

from flask import send_file
app = Flask(__name__)

def otherFunc():
    pass

@app.route("/")
def generatePDF():
    #code to generate PDF........
    StringIO(pdf)
    return send_file(pdf, attachment_filename='file.pdf')

if __name__ == "__main__":
    app.run()

Weasyprint also includes a way to incorporate pre-generated pdfs from within the same application.