Bamboo and Webhooks: Automating CI and Deployment

process_auto

The name of the game in technology is doing more with less. It is the core element in the drive for profitability. In today’s market, that means finding ways to reduce both the number of employees in an organization and improve the processes that allow for such a reduction.

The days of paying $80,000 for an employee with limited knowledge, only ETL or only XML, are waning or dead. Even more importantly, the cost of software bugs in 2016 reached $1.1 trillion.

Reducing labor costs and reducing the cost of software bugs means producing efficient, clean, easy to follow, and effective systems backed by quality processes (word play with quality assurance intended). This article dives into how to use Bamboo, Flask, and Bitbucket to achieve continuous integration and continuous deployment to a test environment.

The concepts in this article can be ported to a distributed environment using ngrok but hopefully a more secure option is available. The ngrok solution is explained well on the Bitbucket website but promotes non-free software and external providers. This article intends to explain how to do this for free without linking to an external provider.

Environment

An automated test environment can be set up with:

  • Bitbucket
  • Flask
  • Bamboo
  • An Ubuntu or Linux Server for the purposes of this article

Bitbucket is a powerful repository based on git. Part of the repositories power is the ability to add webhooks. These are essentially triggers that perform POST requests to a webserver on specified events.

Webhooks require a web server. Flask is a lightweight and easy to maintain, single threaded framework. This is perfect for most repositories that do not see a significant number of pull requests.

Other hooks, including to Jenkins, are available as well.

Another powerful feature of Bamboo worth noting is the ability to organize repositories into projects. This suits companies with well defined architectures extremely well.

Bamboo is a continuous integration (CI) tool, a tool that allows code to be continually built and tested from a Bitbucket repository with minimal human interaction. This tool integrates directly with Jira, Bitbucket, and nearly every other Atlassian product.

Automating a System

Most companies with a forward facing website have production and test environments. Integrating and automating the environment is made possible with the use of a continuous integration (CI) environment and webhooks.

Bamboo can be set up to automatically synchronize code when a commit or merge is made in the remote repository. However, jobs in this CI run in an isolated way with no access to write to the local machine beyond SCP.

While a CI ensures that code runs correctly and can be set to notify users of failed builds, the synchronization issue outside of the isolated build environment is solved with webhooks. Calls to web servers specified in these hooks are triggered whenever a certain event is performed in the repository, allowing for more automated and instantaneous regression and integration testing.

Step 1: Setting Up SSH in Bitbucket

Any automated pull requires using ssh instead of http protocol as the means to obtain code from the remote repository. This requires setting up an ssh key on Bitbucket and on the local machine where your test environment is configured.

Setting up an rsa key on Ubuntu is simple:

  1. run ssh-keygen -t rsa
  2. enter only the name of the key
  3. copy the public key (.pub) to your clipboard
  4. create  or open the config file in your home’s .ssh directory
  5. add your bitbucket host on a new line Host <host ip address>
  6. on a new tab spaced line enter the name of your private key file generated in steps 1-2 as Identity File <private key file name>
  7. log in to Bitbucket
  8. click on your avatar
  9. click on Manage Account
  10. navigate to SSH Keys on the left hand menu
  11. click on add key towards the center of the page and simply paste your key

Step 2: Creating a Webhook in Flask

A webook can be setup simply in Flask. James Innes of Ogma Development wrote a terrific article on how to setup a webhook in Flask. After creating the project as specified, run the code using:

flask run

It is possible to change the application host by simply modifying app.run() in your hook as follows:

app.run(host=<host>, port=[unreserved port])

Note the host, port, and token if applicable when your application is running and perform a GET request with the appropriate token from the machine or container running your Bitbucket repository.

Step 3: Setting up Webhooks in Bitbucket

Webhooks in Bitbucket can occur at the project or repository level. For the best control, this article explains how to create a hook at the repository level. Imagine triggering a multitude of unnecessary webhooks when modifying a single repository.

Setting up a webhook on a Bitbucket repository is simple:

  1. Log in to Bitbucket
  2. navigate to your target project
  3. Navigate to your repository
  4. Click on the settings gear in the left hand menu
  5. Click on webhooks in the left hand menu
  6. Enter a name for your hook
  7. Enter the url of your webhook server (of the form http[s]://webhook[?verify_token=<token>] in this context)
  8. In the options menus, ensure that push and merged are clicked
  9. Click save at the bottom of the screen

Currently, Bitbucket tests these connections with an empty POST request. This is not handled by our code and empty POST requests are generally discouraged anyway.

It is now possible to simply execute a pull request from a module under the webhook function in your Flask application on every merge in or push to the repository. This can be used to synchronize code and even keep a test site constantly ready for testing if you are using Django. As Flask is single threaded, it is better suited for the task of creating hooks.

Create a Build in Bamboo

With our repository thoroughly connected to the remote server, we can create build plans using Bamboo. Bamboo organizes the build flow  underneath a plan into jobs which maintain tasks.

For an annual fee of just ten dollars, it is possible to obtain ten free jobs with an unlimited number of tasks.

Future articles will explain how to integrate applications built with Django, Celery, and PyTorch. Still, even databases created using Djang are recreatable using the only the models and shell commands from the Script task.

As stated before, Bamboo can be set up to notify an appropriate developer on a build failure.

Conclusion

With these elements in place, it is possible to keep running test code up to date and continue to run builds against this environment with minimal human interaction. This greatly reduces labor and thus cost which can be passed to the end user in the form of savings.

ETL 1 Billion Rows in 2.5 Hours Without Paying on 4 cores and 7gb of RAM

There are a ton of ETL tools in the world. Alteryx, Tableau, Pentaho. This list goes on. Out of each, only Pentaho offers a quality free version. Alteryx prices can reach as high as $100,000 per year for a six person company and it is awful and awfully slow. Pentaho is not the greatest solution for streaming ETL either as it is not reactive but is a solid choice over the competitors.

How then, is it possible to ETL large datasets, stream on the same system from a TCP socket, or run flexible computations at speed. Surprisingly, this article will describe how to do just that using Celery and a tool which I am currently working on, CeleryETL.

Celery

Python is clearly an easy language to learn over others such as Scala, Java, and, of course, C++. These languages handle the vast majority of tasks for data science, AI, and mathematics outside of specialized languages such as R. They are likely the front runners in building production grade systems.

In place of the actor model popular with other languages, Python, being more arcane and outdated than any of the popular languages, requires task queues. My own foray into actor systems in Python led to a design which was, in fact, Celery backed by Python’s Thespian.

Celery handles tasks through RabbitMQ or other brokers claiming that the former can achieve up to 50 million messages per second. That is beyond the scope of this article but would theoretically cause my test case to outstrip the capacity of my database to write records. I only hazard to guess at what that would do to my file system.

Task queues are clunky, just like Python. Still, especially with modern hardware, they get the job done fast, blazingly fast. A task is queued with a module name specified as modules are loaded into a registry at run time. The queues, processed by a distributed set of workers running much like an actor in Akka, can be managed externally.

Celery allows for task streaming through chains and chords. The technical documentation is quite extensive and requires a decent chunk of time to get through.

Processing at Speed

Processing in Python at speed requires little more than properly chunking operations, batching record processing appropriately to remove latency, and performing other simple tasks as described in the Akka streams documentation. In fact, I wrote my layer on Celery using the Akka streams play book.

The only truly important operation, chunk your records. When streaming over TCP, this may not be necessary unless TCP connections happen extremely rapidly. Thresholding in this case may be an appropriate solution. If there are more connection attempts than can be completed at once, buffer requests and empty the buffer appropriately upon completion of each chain. I personally found that a maximum bucket size of 1000 for typical records was appropriate and 100 for large records including those containing text blobs was appropriate.

Take a look at my tool for implementation. However, I was able to remap,  split fields to rows, perform string operations, and write to my Neo4J graph database at anywhere from 80,000 to 120,000 records per second.

Conclusion

While this article is shorter than my others, it is something I felt necessary to write in the short time I have to write it. This discovery allows me to write a single language system through Celery, Neo4J, Django, PyQt, and PyTorch for an entire company. That, is phenomenal and only rivaled by Scala which is, sadly, dying despite being a far superior, faster, and less arcane language. By all measures, Scala should have won over the data science community but people detest the JVM. Until this changes, there is Celery.