ETL 1 Billion Rows in 2.5 Hours Without Paying on 4 cores and 7gb of RAM

There are a ton of ETL tools in the world. Alteryx, Tableau, Pentaho. This list goes on. Out of each, only Pentaho offers a quality free version. Alteryx prices can reach as high as $100,000 per year for a six person company and it is awful and awfully slow. Pentaho is not the greatest solution for streaming ETL either as it is not reactive but is a solid choice over the competitors.

How then, is it possible to ETL large datasets, stream on the same system from a TCP socket, or run flexible computations at speed. Surprisingly, this article will describe how to do just that using Celery and a tool which I am currently working on, CeleryETL.

Celery

Python is clearly an easy language to learn over others such as Scala, Java, and, of course, C++. These languages handle the vast majority of tasks for data science, AI, and mathematics outside of specialized languages such as R. They are likely the front runners in building production grade systems.

In place of the actor model popular with other languages, Python, being more arcane and outdated than any of the popular languages, requires task queues. My own foray into actor systems in Python led to a design which was, in fact, Celery backed by Python’s Thespian.

Celery handles tasks through RabbitMQ or other brokers claiming that the former can achieve up to 50 million messages per second. That is beyond the scope of this article but would theoretically cause my test case to outstrip the capacity of my database to write records. I only hazard to guess at what that would do to my file system.

Task queues are clunky, just like Python. Still, especially with modern hardware, they get the job done fast, blazingly fast. A task is queued with a module name specified as modules are loaded into a registry at run time. The queues, processed by a distributed set of workers running much like an actor in Akka, can be managed externally.

Celery allows for task streaming through chains and chords. The technical documentation is quite extensive and requires a decent chunk of time to get through.

Processing at Speed

Processing in Python at speed requires little more than properly chunking operations, batching record processing appropriately to remove latency, and performing other simple tasks as described in the Akka streams documentation. In fact, I wrote my layer on Celery using the Akka streams play book.

The only truly important operation, chunk your records. When streaming over TCP, this may not be necessary unless TCP connections happen extremely rapidly. Thresholding in this case may be an appropriate solution. If there are more connection attempts than can be completed at once, buffer requests and empty the buffer appropriately upon completion of each chain. I personally found that a maximum bucket size of 1000 for typical records was appropriate and 100 for large records including those containing text blobs was appropriate.

Take a look at my tool for implementation. However, I was able to remap,  split fields to rows, perform string operations, and write to my Neo4J graph database at anywhere from 80,000 to 120,000 records per second.

Conclusion

While this article is shorter than my others, it is something I felt necessary to write in the short time I have to write it. This discovery allows me to write a single language system through Celery, Neo4J, Django, PyQt, and PyTorch for an entire company. That, is phenomenal and only rivaled by Scala which is, sadly, dying despite being a far superior, faster, and less arcane language. By all measures, Scala should have won over the data science community but people detest the JVM. Until this changes, there is Celery.

 

Opinion: Does the End of Net Neutrality Signal the Start of Public Internet

Companies are already selling overpriced Internet services with terrible performance and awful customer satisfaction ratings. The end of net neutrality as the rule of the land may exacerbate the problems that plague companies like Comcast. A lack of stable systems, a quilt of poorly deployed technologies, and the hell on earth it is to deal with the company’s technical nightmare (think MySQL database being used for all equipment monitoring) can be easily glossed over by simply forcing certain customers to pay more.

The end of net neutrality could become the equivalent of ObamaCare in the Trump era. It is a chance for another industry to gauge the private sector and the individual under the guise of open markets. The idea of an entirely open market itself was brushed aside by Adam Smith by the way.

In the midst of this nightmare, Longmont is providing a model for public broadband service at low prices. Despite the FCC ruling, the city’s NextLight broadband utility promises no tiered payment plans. Businesses, organizations, and even your local school will rejoice in this type of service. That is, of course, if you are lucky enough to live in Longmont.

NextLight is the type of service that provides impact and reduces the pressure Adam Smith referred to when disavowing companies such as the East India Company, claiming they could never really be a private venture. That is something that should be noticed in today’s world where services are increasingly produced by fewer and fewer companies whose increasing size makes Karl Marx look like a prophet.

Is this experiment profitable? The customer gets 1 gigabyte of download speeds for $45 per month. How, when a similar service costs over $100 per month from a competitor is this profitable? How can a $40.3 million dollar project ever pay off for poorer communities? How can the service be of quality? How can cities without the infrastructure Longmont had afford such a project?

The current project is funding at only 17 percent over the estimated cost. While small municipalities like Longmont do not post cost estimates, there is also the fact that in three to four years of operation, there have been no cost increases and no extra funding granted by taxation. The cost overrun is also relatively small for a public works project. In the same period, I personally have seen my internet services for 10mbs download speeds that rarely deliver their promised performance from Comcast go from $80 to $116 per month. Comcast does not even have the add-on features that $45 per month NextLight serves. Even Century Link offers incredibly slower services for $10 more dollars per month. Only Charter compares.

When it comes to poor communities, voting is more powerful at a local level. However, dealing with the misfortunes of rural and poor communities is an enormous problem in this country. Still, another source for funding may be technology companies themselves. Google and Facebook have already shown interest in divorcing the middle man from the process of getting their services to the consumer. Instead of balloons and airplanes, why not invest in public projects prior to the development of space delivered quantum communications.

It is also important to remember that, like many commonplace public projects, the middle class, Longmont at the beginning of NextLight, and rich, Longmont now due to the Californication of the Colorado Front Range, are often the jumping off point for all communities. Construction costs come down with demand, neighboring communities benefit, and knowledge is shared regarding operations.

As for the quality end. Longmont’s NextLight just beat out Google Fiber. The city is ranked third for customer service and Internet service. While a terrible argument, if you dislike your provider, you the taxpayer could literally fire them by voting.

Of more interesting note, municipalities are more likely to spur innovation as well in the search for better service at less cost. Denver based LayerTV just partnered with NextLight. Apparently, public Internet can actually increase competition.

So, will the end of net neutrality spur a public movement to reopen the Internet as an endless frontier of experimentation and cost savings compared to unregulated Internet? The future will tell.

Do you know the best part about NextLight? No unscheduled maintenance. It almost seems like Comcast violates other FCC rules regarding throttling and content provision already.