Deferred Result: Make Spring Boot Non-Blocking

deferred result is non-blocking

Divyesh Kanzariya, Java Tutorials Spot

Non-blocking I/O is achievable in Spring using the deferred result. This article examines how to improve speed on a heavily trafficked website using Spring and how to test a deferred result.

Spring is fast but suffers from dedicating a single thread per request. The deferred result helps eradicate this issue.

Why is My Spring Boot Website Slow?

Spring is written in Java which does not block. However, Spring dedicates a single thread to each request. No other work is performed on the thread for the life of the request.

This means that performance tuning often revolves around dedicating more RAM or CPU to the website. The JVM must account for threads, and this can become cumbersome no matter how well built a system is.

Executing a single request per thread is particularly problematic when requests to other services are made or heavy computation is performed.  For instance, a program might create a call to a database system. The handling thread will wait until the response is received before continuing. A bottleneck occurs when thousands or millions of  requests are processed at once.

When Can Non-Blocking I/O Help?

Non-blocking I/O as outlined above performs a request on a thread pool. Computation and I/O laden requests are completed in the background while the main thread works on other tasks. By not blocking, the comptuer is free to perform more work this includes handling other requests in Spring.

Java contains a variety of mechanisms to avoid blocking.  A fork-join pool and thread pool are available to the developer.

Fork-join pools are particularly useful. In this pool, idle threads take up work from active threads.

How Can I Perform Non-Blocking I/O in Spring Boot?

Spring Boot allows for non-blocking I/O to be performed through the deferred result. Error or response objects are set in the deferred result which serves as the return value:

@PostMapping("/test_path")
public DeferredResult testDeferred(){
    DeferredResult deferred = new DeferredREsult();
    deferred.onTimeout(() -> Logging.info("timeout");
    Thread thread = (() -> {
        deferred.setResult("Success");
        //deferred.setErrorResult("error");
    });
    ForkJoinPool.commonPool().submit(thread);
    deferred;
}

A deferred object is thread-safe. The return value is set in the result. An error object and function executed on timeout may be configured as well.

How Can I Test Non-Blocking I/O in Spring?

Testing non-blocking I/O is possible through  MockMvc or RestAssured. MockMvc handles asynchronous requests differently from synchronous calls:

ResultActions resultActions = this.mockMvc.perform(post("/test_path"));
MvcResult result = resultActions.andExpect(request().asnycStarted()).andReturn();
result = this.mockMvc.perform(asyncDispatch()).andReturn();

 

After ensuring that the asynchronous behavior started execution and awaiting a return value, testing continues as before.

Conclusion

Spring and Spring Boot can perform non-blocking I/O calls. This allows the framework to wait for computation-heavy processes and asynchronous network requests to complete while other network calls are handled by the original thread.

A fork-join pool or thread pool ensures that only a certain number of threads are created to handle background tasks.

Advertisements

2019 Trend: Data Engineering Becomes a Household Name

2019

Stuck, Flickr

There will be many 2019 trends that last well beyond the year. With Tableau now being a household name, Salesforce being a  workhorse for analytics, SAS continuing to grow through jmp, and small players such as Panoply acquiring funding, one hot 2019 trend in technology will be data engineering.

This underlies a massive problem in  the field of data. Data management tools and frameworks are severely deficient. Many merely perform materialization.

That is changing this year, and it means that data engineering will be an important term over the next few years. Automation will become a reality.

What is a data engineer?

Data engineers create pipelines. This means automating the handling of data from aggregation and ingestion to the modeling and reporting process. These professionals handle big data or even small data loads with streaming playing an important role in their work.

As they cover the entire pipeline for your data and often implement analytics in a repeatable manner, data engineering is a broad task. Terms such as ETL, ELT, verification, testing, reporting, materialization, standardization, normalization, distributed programming, crontab, Kubernetes, microservices, Docker, Akka, Spark, AWS, REST, Postgres, Kafka, and statistics are commonly slung with ease by data engineers.

Until 2019, integrating systems often meant combing a variety of tools into a cluttered wreck. A company might deploy python scripts for visualization, Vantara (formerly Pentaho) for ETL, use a variety of aggregation tools combined in Kafka, have data warehouses in PostgreSQL, and may even still use Microsoft Excel to Store data.

The typical company spends $4000 – $8000 per employ providing these pipelines. This cost is unacceptable and can be avoided in the coming years.

Why won’t ELT kill data engineers?

ELT applications promise to get rid of data engineers.  That is pure nonsense meant to attract ignorant investors money:

  • ELT is horrible for big data with continuous ETL proving more effective
  • ELT is often performed on data sources that already underwent ETL by the companies it was purchased from such as Acxiom, Nasdaq, and TransUnion
  • ELT eats resources in a significant way and often limits its use to small data sets
  • ELT ignores issues related to streaming from surveys and other sources which greatly benefit from the requirements analysis and transformations of ETL
  • ELT is horrible for integration tasks where data standards differ or are non-existent
  • You cannot run good AI or build models on poorly or non-standardized data

This means that ETL will continue to be an major part of a data engineers job.

Of course, since data engineers translate business analyst requirements into reality, the job will continue to be secure. Coding may become less important as new products are released but will never go away in the most efficient organizations.

Why is Python Likely to become less popular?

Many people point to Python as a means for making data engineers redundant. This is simply false.

Python is limited. This means that jvm will rise in popularity with data scientist and even analysts as companies want to make money on the backs of their algorithms. This benefits data engineers who are typically proficient in at least Java, Go, or Scala.

Python works for developers, analysts, and data scientists who want to control tools written in a more powerful language such as C++ or Java. Pentaho experimented with the language being bought by Hitachi. However, being 60 times slower than the JVM and often requiring three times the resources,  it is not an enterprise-grade language.

Python does not provide power. It is not great at parallelism and is single threaded. Any language can achieve parallelism. Python uses heavy OS threads to perform anything asynchronously. This is horrendous.

Consider the case of using Python’s Celery versus Akka, a Scala and Java-based tool. Celery and Akka perform the same tasks across a distributed system.

Parsing millions of records in celery can quickly eat up more than fifty percent of typical server resources with a mere ten processes. RabbitMQ, the messaging framework behind Celery, can only parse 50 million messages per second on a cluster. Depending on the use case, Celery may also require Redis to run effectively. This means that an 18 logical core server with 31 gigabytes of RAM can be severely bogged down processing tasks.

Akka, on the other hand, is the basis for Apache Spark. It is lightweight and all inclusive. 50 million messages per second are attainable with 10 million actors running concurrently at much less than fifty percent of typical servers resources. With not every use case requiring spark, even in data engineering, this is an outstanding difference. Not needing a message routing and results backend means that less skill is required for deployment as well.

Will Scala become popular again?

When I started programming in Scala, the language was fairly unheard of. Many co-workers merely looked at this potent language as a curiosity. Eventually, Scala’s popularity started to wain as java developers were still focused on websites and ignored creating the same frameworks for Scala that exist in Python.

That is changing. With the rise of R, whose syntax is incredibly similar to Scala, mathematicians and analysts are gaining skill in increasingly complex languages.

Perhaps due to this, Scala is making it back into the lexicon of developers. The power of Python was greatly reduced in 2017 as non-existent or previously non-production level tools were released for the JVM.

Consider what is now at least version 1.0 in Scala:

  • Nd4j and Nd4s: A Scala and Java-based non-dimensional array framework that boasts speeds faster than Numpy
  • Dl4J: Skymind is a terrific company producing tools comparable to torch
  • Tensor Flow: Contains APIs for both Java and Scala
  • Neanderthal: A Clojure based linear algebra system that is blazing fast
  • OpenNLP: A new framework that, unlike the Stanford NLP tools, is actively developed and includes named entity recognition and other powerful transformative tools
  • Bytedeco: This project is filled with angels (I actually think they came from heaven) whose innovative and nearly automated JNI creator has created links to everything from Python code to Torch, libpostal, and OpenCV
  • Akka: Lightbend continues to produce distributed tools for Scala with now open sourced split brain resolvers that go well beyond my majority resolver
  • MongoDB connectors: Python’s MongoDB connectors are resource intensive due to the rather terrible nature of Python byte code
  • Spring Boot: Scala and Java are interoperable, but benchmarks of Spring Boot show at least a 10000 request per second improvement over Django
  • Apereo CAS: A single sign-on system that adds terrific security to disparate applications

Many of these frameworks are available in Java.  Since Scala runs any Java programs, the languages are interoperable. Scala is cleaner, functional, highly modular, and requires much less code than Java which puts these tools in the reach of analysts.

What do new tools mean for a data engineer?

The new Java, Scala, and Go tools and frameworks mean that attaining 1000 times the speed on a single machine with significant cost reduction over Python is possible. It also means chaining millions of moving parts to a solid microservices architecture system instead of a cluttered monolithic wreck.

The result is clear. My own company is switching off of Python everywhere except for our responsive and front end heavy web application for a fifty percent cost reduction in hardware.

How will new tools help data engineers?

With everything that Scala and the JVM offers, Data Engineers now have a potent tool for automation. These valuable employees may not be creating the algorithms, but they will be transforming data in smart ways that produce real value.

Companies no longer have to rely on archaic languages to produce messy systems, and this will translate directly into value. Data engineers will be behind this increase in value as they can more easily combine tools into a coherent and flexible whole.

Conclusion

The continued rise of JVM backed tools starting in 2018 will make data pipeline automation a significant part of a company’s IT cost. Data engineers will be behind the evolution of data pipelines from disparate systems to a streamlined whole backed by custom code and new products.

Data engineering will be a hot 2019 trend. After this year, we may just be seeing the creation of Skynet.

Running Django Tests Without Migrations

python_and_mouse

Django is a strong framework for building reliable websites. However, it is showing signs of age. Particularly, it is failing where tools such as Spring Boot excel.

This article examines how to overcome yet another issue encountered in Django, running tests without migrating databases.

Use Case

There are many times when a database migration is inappropriate for testing:

  • we have separated test, development, and production databases with custom features requiring testing
  • there are fixtures pre-loaded into a database
  • we have legwork to complete through scripts alongside database setup
  • dropping and re-creating a database eliminates data that is absolutely necessary to another function in our test environment
  • nothing we do can be included in a TestRunner, at least not reliably

Sadly, Django developers, among other issues they refuse to fix, were not willing to create a workaround until recently. The workaround does not work in the latest version.

Django Unit Test Framework

The Django unit test framework comes with many features typically reserved for tools such as ghost.py or selenium.

For instance, Django provides:

from django.test import TestCase
from django.test import Client
from django.test import RequestFactory

A RequestFactory is used to perform specific tests while the Client acts as a browser would.

While there is an option to use pytest, the changing nature of Django manages to continually break this tool. Importing models from different applications using multiple databases is a broken process at the time of publication.

Unittest Format

Django provides an extensive tutorial for setting up unit tests. The setup is straightforward:

from django.test import TestCase
from django.test import Client
from django.test import RequestFactory


class AdminTestCases(TestCase):
    multi_db = True

    def setUp(self):
        self.client = Client()
        self.rf = RequestFactory()

    def test_create_superuser(self):
        pass

    def test_remove_superuser(self):
        pass

    def test_fail_create_superuser(self):
        pass

    def test_fail_remove_superuser(self):
        pass

    def tearDown(self):
        pass

This skeleton prepares  a client and request factory, generates tests, and provides a tear down function. Setup and tear down appear to run before and after each test.

Multiple Databases

For security and other purposes, it is often necessary to separate databases. This offers a degree of protection for sensitive information.

Unfortunately, this also presents an issue with the tests in which a single database is used. To avoid this, each test case must include the following:

multi_db = True

This line informs the the test runner to search for different databases.

Ignoring Migrated Databases

Now, for the meat and potatoes. We want to ignore database migration altogether at times. It is still wise to have a script and drop any unused data.

Django allows for the generation of custom TestRunner classes. As of Django > 2.0. the following runner avoids migrating a database:

from django.test.runner import DiscoverRunner


class NoMigrationTestRunner(DiscoverRunner):
  """ A test runner to test without database creation """

  def setup_databases(self, **kwargs):
    """ Override the database creation defined in parent class """
    pass

  def teardown_databases(self, old_config, **kwargs):
    """ Override the database teardown defined in parent class """
    pass

The NoMigrationTestRunner extends the DiscoverRunner and directly overwrites the setup_databases and teardown_databases methods. These methods are used to setup certain connections, create database tables, and perform cleanup.

In this example, the setup and teardown methods are left blank to avoid creating new tables as well as to allow for the smooth usage of multiple databases.

Django is informed of the new test runner through a variable in the relevant settings configuration:

TEST_RUNNER = 'path_to.NoMigrationTestRunner'

Execution

Tests are executed using manage.py:

python manage.py test
python<version> manage.py path.to.tests.TestModule --settings=<settings_file>

The first example is generic and collects tests found through our DiscoverRunner. The second command runs tests using a specific python version and a path to the relevant module which is imported using this string. A settings file was also provided in the second command.

Conclusion

Django has extensive documentation. However, as the framework changes, certain aspects will break. This article covered how to setup tests for multiple databases, not migrate databases in test, and how to execute our tests.

Fluff Stuff: Better Governments, Better Processes, Simplr Insites

Cities are heading towards bankruptcy. The credit rating of Stockton, CA was downgraded. Harrisburg, PA is actually bankrupt.  It is only a matter of time before Chicago implodes. Since 1995, city debt rose by between $1.3 and $1.8 trillion. While a large chunk of this cost is from waste, more is the result of using intuition over data when tackling infrastructure and new projects. Think of your city government as the boss who likes football more than his job so he builds a football stadium despite your company being in the submarine industry.

This is not an unsolvable nightmare.

Take an effective use case where technologies and government processes were outsourced. As costs rose in Sandy Sprints, GA, the city outsourced and achieved more streamlined processes, better technology, and lower costs. Today, without raising taxes, the city is in the green. While Sandy Springs residents are wealth, even poorer cities can learn from this experience.

Cities run projects in an extremely scientific manner and require an immense amount of clean, quality, well-managed data isolated into individual projects to run appropriately. With an average of $8000 spent per employee on technology each year and with an immense effort spent in acquiring analysts and modernizing infrastructure, cities are struggling to modernize.

It is my opinion, one I am basing a company on, that the provision of quality data management, sharing and collaboration tools, IT infrastructure, and powerful project and statistical management systems in a single SAAS tool can greatly reduce the $8000 per employee cost and improve budgets. These systems can even reduce the amount of administrative staff, allowing money to flow to where it is needed most.

How can a collaborative tool tackle the cost problem. Through:

  • collaborative knowledge sharing of working, ongoing, and even failed solutions
  • public facing project blogs and information on organizations, projects, statistical models, results, and solutions that allow even non-mathematical members of an organization to learn about a project
  • a security minded institutional resource manager (IRM better thought of as a large, securable, shared file storage system) capable of expanding into the petabytes while maintaining FERPA, HIPPA, and state and local regulations
  • the possibility to share data externally, keep it internal, or keep the information completely private while obfuscating names and other protected information
  • complexity analysis (graph based analysis) systems for people, projects, and organizations clustered for comparison
  • strong comparison tools
  • potent and learned aggregation systems with validity in mind ranging from streamed data from sensors and the internet to surveys to uploads
  • powerful drag and drop integration and ETL with mostly automated standardization
  • deep diving upload, data set, project, and model exploration using natural language searching
  • integration with everything from a phone to a tablet to a powerful desktop
  • access controls for sharing the bare minimum amount of information
  • outsourced IT infrastructure including infrastructure for large model building
  • validation using proven algorithms and increased awareness of what that actually means
  • anomaly detection
  • organization of models, data sets, people, and statistical elements into a single space for learning
  • connectors to popular visualizers such as Tableau and Vantara with a customize-able dashboard for general reporting
  • downloadable sets with two entity verification if required that can be streamed or used in Python and R

Tools such as ours significantly reduce the cost of IT by as much as 65%. We eliminate much of the waste in the data science pipeline while trying to be as simple as possible.

We should consider empowering and streamlining the companies, non-profits, and government entities such as schools and planning departments that perform vital services before our own lives are greatly effected. Debt and credit are not solutions to complex problems.

Take a look, or don’t. This is a fluff piece on something I am passionately building. Contact us if you are interested in a beta test.

Automating Django Database Re-Creation on PostgreSQL

border_wall

photo: Oren Ziv/Activestills.org

The Django database migration system is a mess. This mess is made more difficult when using PostgreSQL.

Schemas go unrecognized without a work around, indices can break when using the work around, and processes are very difficult to automate with the existing code base.

The authors of Django, or someone such as myself who occasionally finds the time to dip in and unclutter a mess or two, will eventually resolve this. However, the migration system today is a train wreck.

My current solution to migrating a Postgres based Django application is presented in this article.

Benefits of Getting Down and Dirty

The benefits of taking the time to master this process for your own projects are not limited to the ability to automate Django. They include the ability to easily manage migrations and even automate builds.

Imagine having a DDL that is defined in your application and changed with one line of code without ever touching your database.

Django is a powerful and concurrent web framework with an enormous amount of add-ons and features. Mastering this task can open your code base to amazing amounts of abstraction and potential.

Well Defined Problems

While Django is powerful, integration with Postgres is horrible. A schema detection problem can cause a multitude of problems as can issues related to using multiple databases. When combined, these problems sap hours of valuable time.

The most major issues are:

  • lack of migration support for multiple databases
  • lack of schema support (an issue recognized over 13 years ago)
  • some indices (PostGIS polygon ids here) break when using the schema workaround

Luckily, the solutions for these problems are simple and require hacking only a few lines of your configuration.

Solving the Multiple Database Issue

This issue is more annoying than a major problem. Simply obtain a list of your applications and use your database names as defined by your database router in migration.

If there are migrations you want to reset, use the following article instead of the migrate command to delete your migrations and make sure to drop tables from your database without deleting any required schemas.

Using python[version] manage.py migrate [app] zero  did not actually drop and recreate my tables for some reason.

For thoroughness, discover your applications using:

python[verion] manage.py showmigrations

With your applications defined, make the migrations and run the second line of code in an appropriate order for each application:

python[version] manage.py makemigrations [--settings=my_settings.py]
python[version] manage.py migrate [app] --database= [--settings=my_settings.py]
....

As promised, the solution is simple. The –database switch matches the database name in your configuration and must be present as Django only recognizes your default configuration if it is not. This can complete migrations without actually performing them.

Solving the Schema Problem without Breaking PostGIS Indices

Django offers terrific Postgres support, including support for PostGIS. However, Postgres schemas are not supported. Tickets were opened over 13 years ago but were merged and forgotten.

To ensure that schemas are read properly, setup a database router and add a database configuration utilizing the following template:

"default": {
        "ENGINE": "django.contrib.gis.db.backends.postgis",
        "NAME": "test",
        "USER": "test",
        "PASSWORD": "test",
        "HOST": "127.0.0.1",
        "PORT": "5432",
        "OPTIONS": {
            "options": "-csearch_path=my_schema,public"
        }
 }

An set of options append to your dsn including a schema and a backup schema, used if the first schema is not present, are added via the configuration. Note the lack of spacing.

Now that settings.py is configured and a database router is established, add the following Meta class to each Model:

    class Meta():
        db_table=u'schema\".\"table'

Notice the awkward value for db_table. This is due to the way that tables are specified in Django. It is possible to leave managed as True, allowing Django to perform migrations, as long as the database is cleaned up a bit.

If there are any indices breaking on migration at this point, simply drop the table definition and use whichever schema this table ends up in. There is no apparent work around for this.

Now Your Migration Can Be Run In a Script, Even in a CI Tool

After quite a bit of fiddling, I found that it is possible to script and thus automate database builds. This is incredibly useful for testing.

My script, written to recreate a test environment, included a few direct SQL statements as well as calls to manage.py:

echo "DROP SCHEMA IF EXISTS auth CASCADE" | python3 ./app_repo/simplred/manage.py dbshell --settings=test_settings
echo "DROP SCHEMA IF EXISTS filestorage CASCADE" | python3 ./app_repo/simplred/manage.py dbshell --settings=test_settings
echo "DROP SCHEMA IF EXISTS licensing CASCADE" | python3 ./app_repo/simplred/manage.py dbshell --settings=test_settings
echo "DROP SCHEMA IF EXISTS simplred CASCADE" | python3 ./app_repo/simplred/manage.py dbshell --settings=test_settings
echo "DROP SCHEMA IF EXISTS fileauth CASCADE" | python3 ./app_repo/simplred/manage.py dbshell --settings=test_settings
echo "DROP SCHEMA IF EXISTS licenseauth CASCADE" | python3 ./app_repo/simplred/manage.py dbshell --settings=test_settings
echo "DROP OWNED BY simplrdev" | python3 ./app_repo/simplred/manage.py dbshell --settings=test_settings
echo "CREATE SCHEMA IF NOT EXISTS auth" | python3 ./app_repo/simplred/manage.py dbshell --settings=test_settings
echo "CREATE SCHEMA IF NOT EXISTS filestorage" | python3 ./app_repo/simplred/manage.py dbshell --settings=test_settings
echo "CREATE SCHEMA IF NOT EXISTS licensing" | python3 ./app_repo/simplred/manage.py dbshell --settings=test_settings
echo "CREATE SCHEMA IF NOT EXISTS simplred" | python3 ./app_repo/simplred/manage.py dbshell --settings=test_settings
echo "CREATE SCHEMA IF NOT EXISTS licenseauth" | python3 ./app_repo/simplred/manage.py dbshell --settings=test_settings
echo "CREATE SCHEMA IF NOT EXISTS fileauth" | python3 ./app_repo/simplred/manage.py dbshell --settings=test_settings
find ./app_repo -path "*/migrations/*.py" -not -name "__init__.py" -delete
find ./app_repo -path "*/migrations/*.pyc"  -delete
python3 ./app_repo/simplred/manage.py makemigrations --settings=test_settings
python3 ./app_repo/simplred/manage.py migrate auth --settings=test_settings --database=auth
python3 ./app_repo/simplred/manage.py migrate admin --settings=test_settings --database=auth
python3 ./app_repo/simplred/manage.py migrate registration --settings=test_settings --database=auth
python3 ./app_repo/simplred/manage.py migrate sessions --settings=test_settings --database=auth
python3 ./app_repo/simplred/manage.py migrate oauth2_provider --settings=test_settings --database=auth
python3 ./app_repo/simplred/manage.py migrate auth_middleware --settings=test_settings --database=auth
python3 ./app_repo/simplred/manage.py migrate simplredapp --settings=test_settings --database=data
python3 ./app_repo/simplred/manage.py loaddata --settings=test_settings --database=data fixture

Assuming the appropriate security measures are followed, this script works well in a Bamboo job.  My script drops and recreates any necessary database components as well as clears migrations and then creates and performs migrations. Remember, this script recreates a test environment.

The running Django application, which is updated via a webhook, doesn’t actually break when I do this. I now have a completely automated test environment predicated on merely merging pull requests into my test branch and ignoring any migrations folders through .gitignore.

Conclusion

PostgreSQL is powerful, as is Django, but combining the two requires some finagling to achieve maximum efficiency. Here, we examined a few issues encountered when using Django and Postgres together and discussed how to solve them.

We discovered that it is possible to automate database migration without too much effort if you already know the hacks that make this process work.

 

Bamboo and Webhooks: Automating CI and Deployment

process_auto

The name of the game in technology is doing more with less. It is the core element in the drive for profitability. In today’s market, that means finding ways to reduce both the number of employees in an organization and improve the processes that allow for such a reduction.

The days of paying $80,000 for an employee with limited knowledge, only ETL or only XML, are waning or dead. Even more importantly, the cost of software bugs in 2016 reached $1.1 trillion.

Reducing labor costs and reducing the cost of software bugs means producing efficient, clean, easy to follow, and effective systems backed by quality processes (word play with quality assurance intended). This article dives into how to use Bamboo, Flask, and Bitbucket to achieve continuous integration and continuous deployment to a test environment.

The concepts in this article can be ported to a distributed environment using ngrok but hopefully a more secure option is available. The ngrok solution is explained well on the Bitbucket website but promotes non-free software and external providers. This article intends to explain how to do this for free without linking to an external provider.

Environment

An automated test environment can be set up with:

  • Bitbucket
  • Flask
  • Bamboo
  • An Ubuntu or Linux Server for the purposes of this article

Bitbucket is a powerful repository based on git. Part of the repositories power is the ability to add webhooks. These are essentially triggers that perform POST requests to a webserver on specified events.

Webhooks require a web server. Flask is a lightweight and easy to maintain, single threaded framework. This is perfect for most repositories that do not see a significant number of pull requests.

Other hooks, including to Jenkins, are available as well.

Another powerful feature of Bamboo worth noting is the ability to organize repositories into projects. This suits companies with well defined architectures extremely well.

Bamboo is a continuous integration (CI) tool, a tool that allows code to be continually built and tested from a Bitbucket repository with minimal human interaction. This tool integrates directly with Jira, Bitbucket, and nearly every other Atlassian product.

Automating a System

Most companies with a forward facing website have production and test environments. Integrating and automating the environment is made possible with the use of a continuous integration (CI) environment and webhooks.

Bamboo can be set up to automatically synchronize code when a commit or merge is made in the remote repository. However, jobs in this CI run in an isolated way with no access to write to the local machine beyond SCP.

While a CI ensures that code runs correctly and can be set to notify users of failed builds, the synchronization issue outside of the isolated build environment is solved with webhooks. Calls to web servers specified in these hooks are triggered whenever a certain event is performed in the repository, allowing for more automated and instantaneous regression and integration testing.

Step 1: Setting Up SSH in Bitbucket

Any automated pull requires using ssh instead of http protocol as the means to obtain code from the remote repository. This requires setting up an ssh key on Bitbucket and on the local machine where your test environment is configured.

Setting up an rsa key on Ubuntu is simple:

  1. run ssh-keygen -t rsa
  2. enter only the name of the key
  3. copy the public key (.pub) to your clipboard
  4. create  or open the config file in your home’s .ssh directory
  5. add your bitbucket host on a new line Host <host ip address>
  6. on a new tab spaced line enter the name of your private key file generated in steps 1-2 as Identity File <private key file name>
  7. log in to Bitbucket
  8. click on your avatar
  9. click on Manage Account
  10. navigate to SSH Keys on the left hand menu
  11. click on add key towards the center of the page and simply paste your key

Step 2: Creating a Webhook in Flask

A webook can be setup simply in Flask. James Innes of Ogma Development wrote a terrific article on how to setup a webhook in Flask. After creating the project as specified, run the code using:

flask run

It is possible to change the application host by simply modifying app.run() in your hook as follows:

app.run(host=<host>, port=[unreserved port])

Note the host, port, and token if applicable when your application is running and perform a GET request with the appropriate token from the machine or container running your Bitbucket repository.

Step 3: Setting up Webhooks in Bitbucket

Webhooks in Bitbucket can occur at the project or repository level. For the best control, this article explains how to create a hook at the repository level. Imagine triggering a multitude of unnecessary webhooks when modifying a single repository.

Setting up a webhook on a Bitbucket repository is simple:

  1. Log in to Bitbucket
  2. navigate to your target project
  3. Navigate to your repository
  4. Click on the settings gear in the left hand menu
  5. Click on webhooks in the left hand menu
  6. Enter a name for your hook
  7. Enter the url of your webhook server (of the form http[s]://webhook[?verify_token=<token>] in this context)
  8. In the options menus, ensure that push and merged are clicked
  9. Click save at the bottom of the screen

Currently, Bitbucket tests these connections with an empty POST request. This is not handled by our code and empty POST requests are generally discouraged anyway.

It is now possible to simply execute a pull request from a module under the webhook function in your Flask application on every merge in or push to the repository. This can be used to synchronize code and even keep a test site constantly ready for testing if you are using Django. As Flask is single threaded, it is better suited for the task of creating hooks.

Create a Build in Bamboo

With our repository thoroughly connected to the remote server, we can create build plans using Bamboo. Bamboo organizes the build flow  underneath a plan into jobs which maintain tasks.

For an annual fee of just ten dollars, it is possible to obtain ten free jobs with an unlimited number of tasks.

Future articles will explain how to integrate applications built with Django, Celery, and PyTorch. Still, even databases created using Djang are recreatable using the only the models and shell commands from the Script task.

As stated before, Bamboo can be set up to notify an appropriate developer on a build failure.

Conclusion

With these elements in place, it is possible to keep running test code up to date and continue to run builds against this environment with minimal human interaction. This greatly reduces labor and thus cost which can be passed to the end user in the form of savings.

Event Chaining Framework in JavaScript

Stainless-Steel-Chain

Asynchronous code flow is notoriously difficult to follow. With the deprecation of synchronous Ajax calls and increasing use of concurrent code in JavaScript, we often need to have functions execute in a step by step process via chaining.

This article presents a short solution to chaining asynchronous code in JavaScript. I wrote a small event manager in JavaScript using the principals defined in the article.

Deferred

The JQuery Deferred object acts much like promises and futures in more powerful object orient languages such as Java and Scala. A deferred object is created and used to call a function or chain of functions as needed.

var dfrd = $.Deferred();
dfrd.then(func1).then(function(result){
    console.log(result);
});
dfrd.resolve('result');

A deferred object works in exactly the same way as a JQuery Ajax call and uses the same method to handle an asynchronous method call. The created deferred can be chained using .then which allows for a result to be passed via .resolve.

A deferred object must be resolved to execute a chain. The .then method allows for not only a callback to be executed on success, but failure and progress callbacks to be executed as well.

Nesting Problem With JQuery Solutions

Nesting is messy. Unfortunately, with the JQuery solution, it is impossible to chain deferred functions without nesting them.

$.when(func1()).then(function(result1) { 
    $.when(func2(result1)).then(function(result2) {
        alert(result2);
    }) 
});

With this solution, the deferred method waits on another result before executing the next function. The call to .when allows for a promise to be utilized.

Custom Chaining Solution

While nesting is useful, it is not always the cleanest solution. Another option is to create a stack and then pop from the stack while generating a new chain.

function get_chain(step, previous_chain){
    var to = step.timeout;
    var func = step.func;
    var new_chain =  $.Deferred();
    if(to > 0){
        new_chain.then(function(result){
            return execute_step_timeout(to, result);
        })
    }
    new_chain.then(function(result){
           return func(result);
    });
    if(previous_chain){
        new_chain.then(function(result){
            previous_chain.resolve(result);
        });
    }
    return new_chain;
}

Here, a function takes a step configuration containing any deferred timeout and a previous chain and constructs a new chain.

Such a process is akin to writing the following code:

var nchain = $.Deferred()
nchain.then(function(result){
	return execute_step_timeout(1, result);
}).then(function(result){
	return func2(result);
}).then(function(result2) {
    	    alert(result2);
});

$.when(func1()).then(function(result1) {

Conclusion

Asynchronous Javascript, while not parallel, often requires chaining. This article reviewed how this can be done cleanly using the JQuery Deferred object.

 

Red Hat, Is IBM Making a Desperate Grab?

CentOS, Fedora, Ubuntu, which comes to mind first. Even in the world of servers, the likely answer is Ubuntu which is debian based and not Red Hat based.

There are a few enormous reasons why Ubuntu and debian are winning the battle for Linux supremacy.

  1. Ease of Use
  2. Widely supported
  3. Extremely configurable
  4. Open source all the way through and thus FREE
  5. Popular, both an upside and a downside
  6. Ubuntu is a leader in developing distributed computing systems
  7. Ubuntu works with Amazon and others and will continue to do so unless it is sold

These are only a few of the reasons Ubuntu and debian constantly top the list of popular Linux distributions.

The same cannot easily be said for Fedora or CentOS. The lag in support alone costs time and money while frustrating users. While most aspects of the operating system are the same, Ubuntu has some serious advantages in a rapidly changing world.

The main advantage given for using CentOS is security. However, put in another way CentOS often lags behind in terms of support for various packages to allow a testing phase.  This might mean that a small undetected security breach leaks all of your files to a hacker while being quietly patched by a developer. It also means installing some cutting edge technology from source. CentOS may not be more secure at all.

Furthermore, when developing security applications or building extremely small packages, Arch Linux is typically the operating system of choice.

Fedora is definitely an unstable release as it tends to be where testing occurs for Red Hat.

Now, consider some of the other downsides which start to resemble a loop:

  • less popular operating systems tend to receive less support
  • operating systems that were once popular may be out of reach of many when starting to sell commercially (a concern for startups)
  • IBM is notoriously behind the times and seeing revenue decline
  • IBM has failed to capitalize even when it has a market advantage by years
  • commercial products restrict usage, further limiting the user base

With an increased likelihood that Red Hat distributions will suddenly stop being made available by Amazon and used by Oracle, Red Hat’s competition could be seeing purchase offers and a huge boost. However, if Ubuntu remains free, it’s already nascent popularity could spike a little.

There are trends which might break this loop. Commercial products always receive support. Commercial products are also pedaled in schools like drugs are to doctors. The last point is more mute as degree seeking students in non-business technical classes (where databases/operating systems/more technical math are taught) tend to shun anything that costs them money and Oracle and Android have an advantage everywhere else.

Basically, in the specialist, educational, and enthusiast markets, consumers and professionals may well start to shy away from Red Hat in fear of tying themselves and their products and wallets to IBMs.

All things given, why then, would IBM buy Red Hat for over $30 billion. They make $3 million per year.

  • Improve and modernize their own systems – at $30 billion?
  • Acquire talent that might improve their future product offerings to compete with Microsoft, Amazon, and the open source community – at $30 billion?
  • Spend even more money to build a business operating system to compete with Windows to try to attract Microsoft customers (Linux users don’t use Windows for the reasons they won’t use Red Hat now)

I am sure that the last point is fairly mute with Microsoft’s decades long technological leap over IBM.

Perhaps I am not seeing something but the conclusion seems to be that IBM’s lag in the AI space after an early lead has left it desperate and that, with the actual decline of Red Hat’s popularity and stock price over the last year mirroring the growth of other systems over the last 10 years, two dying animals are trying to combine their strengths.

The Case for Using an IRM to Scale Data Intake

Among many, there are three major problems faced by an analyst before data is useful:

  • data aggregation and storage
  • data security and access
  • data wrangling (ETL/ELT)

This article deals with data security and access using an information resource management system, IRM. My own company, Simplr Insites LLC, is writing such a system alongside a file storage solution in an effort to modernize the research process.

Problem

One significant problem faced in research and cooperation is the attainment of clean and useful data. Obtaining this data often means gaining access to systems, forming legal agreements, obfuscating certain data, and embarking on the painful process of data wrangling.

While ETL and ELT are critical steps, just obtaining sensitive data, even from within an organization, is tricky. Consider the following cases related directly to access:

  • data sets include confidential information
  • data sets are ensnared in legal agreements regarding who can access data
  • users want to control access to data to ensure it is not misused
  • external users are allowed varying degrees of access

IRM as a Solution

Oracle generated a solution that attempts to tackle the data security issue. The Oracle IRM documentation provides a rather informative graphical overview of their tool:

irm

In this system, an external user accesses a load balanced IRM server application which controls rights and access to different resources and files. Several firewalls help to improve security along with authentication, access grants, and encryption. Web services  and internal users utilize the IRM server as well.

Beyond the visible components, tokens can be used to instantly manage resources and propagate access changes.

Most file systems also offer the capability to pull the date when a resource was created or modified and various permissions information. This is useful for logging purposes.

Setting Up an IRM

It is not necessary to rely on Oracle for an IRM solution. In fact, the Oracle IRM only works with Microsoft Windows.

Each component can be paired with a reliable tool, most of which I have blogged about. A set of pairings might include

Base Application and Resource Management Django with Secure Login
REST API Resource Access Django OAuth Toolkit
Access Management Django Oauth Toolkit and a Database System
Individual Resource Tokens Randomly Generated and Hashed Key
File Storage GlusterFS or an Encrpytable File System
Encryption of Resources PyCrypto or a Similar Tool
Firewalls IP Tables or another firewall
Two Step Verification through SMS Twilio
Key Storage Stack Exchange Blackbox
VPN Access Firefox
Logging and Anomaly Detection Elastic APM and the ElkStack

Logging

Logging is critical to security. Logs allow administrators to spot harmful activity, generate statistical models based on usage, and aid in auditing the system.

Tokens

Tokens are a perfect solution for controlling document access in the system. They allow a user to gain access to a document, offer scopes for access, and often contain scopes that grant levels of access to a resource.

A user should be required to log in to the application to retrieve a token which refreshes on a regular schedule. These tokens can be revoked and changed by a resource owner or administrator much like using a file system.

Fernet Encryption

While RSA encryption is useful for two way encryption, Fernet encryption is stronger and more useful for storing files. If a system does not offer encryption, tools such as PyCrypto offer Fernet encryption.

Storing Keys

Keys should not be stored in the open. If compromised, it is extremely easy to gain access to a key stored in plain text. Instead, tools such as Stack Exchange’s Blackbox store keys in a system backed by a GPG key ring.

Two Step Downloading for Extra Security

Downloading a file in a secure manner might require extra protection, particularly when an external but trusted user desires access to a resource. To avoid spoofing and avoid a compromised computer from gaining access to a resource, two step verification is a recommended step.

In this process the external user provides an access token to obtain a document which is verified. On verification, a text message containing an access code is sent to the external user and the internal user is notified of the access. The external user enters the code and, if required, the resource owner or admin approves the download.

This type of process is not difficult to implement through desktop or web applications using push notifications or persistent storage.

Conclusion

Secured yet accessible storage is a critical problem for any data analyst or scientist. Using an established IRM or implementing a similar tool helps secure access and empower analytics.

Better Key Storage With Blackbox, RSA, Redis, and the Fernet Algorithm in Django

safe

Python lacks a proper key store.  This is an unnerving issue when trying to build a secure application. More troubling is the plain text storage of RSA keys. This article examines a process for storing keys in an encrypted manner on black box as well as the storage of keys using the Fernet algorithm and encryption through RSA in Django using redis for speed.

Problem

Unlike Java which already has a key store, Python lacks the ability to store keys for data encryption. Python developers are left with only basic methods for storing keys and this often means doing so in plain text.

That method is inexplicably terrible when working with FERPA/HIPPA and especially the increasingly difficult state guidelines for storing sensitive information.

Solution

One solution, of many, is to use Stack Exchange black box to store keys and the Fernet algorithm to encrypt the keys in a cache. In this way keys are stored in an encrypted format in a hidden file as well as in a secure format in memory.

Black Box

Stack Exchange’s Black Box offers a perfect storage solution for keys using a gpg keyring to encrypt data. The tool was made to store secrets in a git repository.

Check out my Python API for reading files from black box. It is possible to add a user to the administrator file in order to avoid entering a password each time.

Storing Encrypted Keys in Django

Once the keys are encrypted and accessible, a large application needs to ensure speed. To help alleviate sluggishness, it is possible to store keys using the Fernet algorithm in any cache that Django provides.

It is possible to use the cryptography package for this task.

from cryptography.fernet import Fernet
from django.core.cache import cache

key = Fernet.generate_key()
f = Fernet(key)
token = f.encrypt(b"my deep dark secret")
cache.set('my_token', token)

Conclusion

It is possible to recreate a secure keystore using a mix of Stack Exchange Black Box and the Fernet algorithm when creating a Django application. The implementation above may not be production ready but is a proof of concept.