2019/2020 – The Years of Data Engineering (Opinion)

pipeline

Photo by Flickr user Stuck in Customs/Creative Commons

The new year brings new hopes for all of the hottest technologies built over the past few years. The last few years were filled with visualization tools and frameworks. Tableau is now a household name, Salesforce is a workhorse for analytics, SAS continues to grow through jmp, and small players such as Panoply are well funded.

This underlies an important and missing component in data. Data management tools and frameworks are severely deficient. Many merely perform materialization.

That is changing this year and it means that data engineering will be an important term over the next few years. Automation will become a reality.

The Data Engineering Problem

Data engineers create pipelines. This means automating the handling of data from aggregation and ingestion to the modeling and reporting process.

As they cover the entire pipeline for your data and often implement analytics in a repeatable manner, data engineering is a broad task. Terms such as ETL, ELT, verification, testing, reporting, materialization, standardization, normalization, distributed programing, crontab, Kubernetes, microservices, Docker, Akka, Spark, AWS, REST, Postgres, Kafka, and statistics are commonly slung with ease by data engineers.

Until 2019, integrating systems often meant combing a variety of tools into a cluttered wreck. A company might deploy python scripts for visualization, Vantara (formerly Pentaho) for ETL, use a variety of aggregation tools combined in Kafka, have data warehouses in PostgreSQL, and may even still use Microsoft Excel to Store data.

The typical company spends $4000 – $8000 per employ providing these pipelines. This cost is unacceptable and  can be avoided in the coming years.

ELT will Not Kill Data Engineers

ELT applications promise to get rid of data engineers but that is pure nonsense meant to attract an ignorant investors money:

  • ELT is often performed on data sources that already underwent ETL by the companies it was purchased from such as Axciom, Nasdaq, and TransUnion
  • ELT eats resources in a significant way and often limits its use to small data sets
  • ELT ignores issues related to streaming from surveys and other sources which greatly benefit from the requirements analysis and transformations of ETL
  • ELT is horrible for integration tasks where data standards differ or are not existant
  • You cannot run good AI or build models on poorly or non-standardized data

This means that ETL will continue to be an important part of a data engineers job.

Of course, since data engineers translate business analyst requirements into reality, the job will continue to be secure. Coding may become less important as new products are released but will never go away in the most efficient organizations.

Limits of Python and GoLang Benefits the Data Engineering Stack

Many people point to Python as a means for making data engineers redundant. This is simply false.

Python is limited and GoLang is only about 30 times faster than Python. This means that jvm will rise in popularity with data scientist and even analysts as companies want to make money on the backs of their algorithms. This benefits data engineers who are typically proficient in at least Java or Scala.

Python works for developers, analysts, and data scientists who want to control tools written in a more powerful language such as C++ or Java. Pentaho dabbled in this before being bought by Hitachi. However, being 60 times slower than the jvm and often requiring three times the resources,  it is not an enterprise grade language.

Python does not provide power. It is not great at parallelism and is single threaded. Any language can achieve parallelism. Python uses heavy OS threads to perform anything asynchronously. This is horrendous.

Consider the case of using Python’s Celery versus Akka, a Scala and Java based tool. Celery and Akka perform the same tasks across a distributed system.

Parsing millions of records in celery can quickly eat up more than fifty percent of a typical servers resources with a mere ten processes. RabbitMQ, the messaging framework behind Celery, can only parse 50 million messages per second on a cluster. Depending on the use case, Celery may also require Redis to run effectively. This means that an 18 logical core server with 31 gigabytes of RAM can be severely bogged down simply processing tasks.

Akka, on the other hand, is the basis for Apache Spark. It is lightweight and all inclusive. 50 million messages per second is attainable  with 10 million actors running concurrently at much less than fifty percent of a typical servers resources. With not every use case requiring spark, even in data engineering, this is an outstanding difference. Not requiring a message routing and results backend means that less skill is required for deployment as well.

The Rise, Fall, and Rise of Scala and Streamlining

When I started programming in Scala, the language was fairly unheard of. Many co-workers merely looked at this potent language as a curiosity. Eventually, Scala’s popularity started to wain as java developers were still focused on websites and ignored creating the same frameworks for Scala that exist in Python.

That is changing. With the rise of R, whose syntax is incredibly similar to Scala, mathematicians and analysts are becoming incredibly used to programming in increasingly complex languages.

Perhaps due to this, Scala is making it back into the lexicon of developers. The power of Python was greatly reduced in 2017 as non-existent or previously non-production level tools were released for the jvm.

Consider what is now at least version 1.0 in Scala:

  • Nd4j and Nd4s: A Scala and Java based non-dimensional array framework that boasts speeds faster than Numpy
  • Dl4J: Skymind is a terrific company producing tools comparable to torch
  • Tensor Flow: Contains APIs for both Java and Scala
  • Neanderthal: A clojure based linear algebra system that is blazing fast
  • OpenNLP: A new framework that, unlike the Stanford NLP tools, is actively developed and includes named entity recognition and other powerful transformative tools
  • Bytedeco: This project is filled with angels (I actually think they came from heaven) whose innovative and nearly automated JNI creator has created links to everything from Python code to Torch, libpostal, and OpenCV
  • Akka: Lightbend continues to produce distributed tools for Scala with now open sourced split brain resolvers that go well beyond my majority resolver
  • MongoDB connectors: Python’s MongoDB connectors are resource intensive due to the rather terrible nature of Python byte code
  • Spring Boot: Scala and Java are interoperable but benchmarks of Spring Boot show at least a 10000 request per second improvement over Django
  • Apereo CAS: A single sign on system that adds terrific security to disparate applications

Many of these frameworks are available in Java.  Since Scala runs any Java programs, the languages are interoperable. Scala is cleaner, functional, highly modular, and requires much less code than Java which puts these tools in the reach of analysts.

What does this mean for a data engineer?

It means attaining 1000 times the speed on a single machine with significant cost reduction and up to a 33 percent code reduction over Java. Some developers report a 20 percent speed boost over Java but that is likely due to poor coding practices.

It also means moving from millions of moving parts to a steadfast system.

The result is clear. My own company is switching off of Python everywhere except for our responsive and front end heavy web application for a fifty percent cost reduction in hardware.

Putting it all Together to Unclutter a Mess

With everything that Scala and the jvm offers, Data Engineers now have a potent tool for automation. These valuable employees may not be creating the algorithms but they will be transforming data in smart ways that produce real value.

Companies no longer have to rely on archaic languages to produce messy systems and this will translate directly into value. Data engineers will be behind this increase in value as they can more easily combine tools into a coherent and flexible whole.

Conclusion

The continued rise of jvm backed tools starting in 2018 will make data pipeline automation a significant part of a companies IT cost. Data engineers will be behind the evolution of data pipelines from disparate systems to a streamlined whole backed by custom code and new products.

2019 and 2020 will be the years of data engineering. After this, we may just be seeing the creation of skynet.

Fluff Stuff: Better Governments, Better Processes, Simplr Insites

Cities are heading towards bankruptcy. The credit rating of Stockton, CA was downgraded. Harrisburg, PA is actually bankrupt.  It is only a matter of time before Chicago implodes. Since 1995, city debt rose by between $1.3 and $1.8 trillion. While a large chunk of this cost is from waste, more is the result of using intuition over data when tackling infrastructure and new projects. Think of your city government as the boss who likes football more than his job so he builds a football stadium despite your company being in the submarine industry.

This is not an unsolvable nightmare.

Take an effective use case where technologies and government processes were outsourced. As costs rose in Sandy Sprints, GA, the city outsourced and achieved more streamlined processes, better technology, and lower costs. Today, without raising taxes, the city is in the green. While Sandy Springs residents are wealth, even poorer cities can learn from this experience.

Cities run projects in an extremely scientific manner and require an immense amount of clean, quality, well-managed data isolated into individual projects to run appropriately. With an average of $8000 spent per employee on technology each year and with an immense effort spent in acquiring analysts and modernizing infrastructure, cities are struggling to modernize.

It is my opinion, one I am basing a company on, that the provision of quality data management, sharing and collaboration tools, IT infrastructure, and powerful project and statistical management systems in a single SAAS tool can greatly reduce the $8000 per employee cost and improve budgets. These systems can even reduce the amount of administrative staff, allowing money to flow to where it is needed most.

How can a collaborative tool tackle the cost problem. Through:

  • collaborative knowledge sharing of working, ongoing, and even failed solutions
  • public facing project blogs and information on organizations, projects, statistical models, results, and solutions that allow even non-mathematical members of an organization to learn about a project
  • a security minded institutional resource manager (IRM better thought of as a large, securable, shared file storage system) capable of expanding into the petabytes while maintaining FERPA, HIPPA, and state and local regulations
  • the possibility to share data externally, keep it internal, or keep the information completely private while obfuscating names and other protected information
  • complexity analysis (graph based analysis) systems for people, projects, and organizations clustered for comparison
  • strong comparison tools
  • potent and learned aggregation systems with validity in mind ranging from streamed data from sensors and the internet to surveys to uploads
  • powerful drag and drop integration and ETL with mostly automated standardization
  • deep diving upload, data set, project, and model exploration using natural language searching
  • integration with everything from a phone to a tablet to a powerful desktop
  • access controls for sharing the bare minimum amount of information
  • outsourced IT infrastructure including infrastructure for large model building
  • validation using proven algorithms and increased awareness of what that actually means
  • anomaly detection
  • organization of models, data sets, people, and statistical elements into a single space for learning
  • connectors to popular visualizers such as Tableau and Vantara with a customize-able dashboard for general reporting
  • downloadable sets with two entity verification if required that can be streamed or used in Python and R

Tools such as ours significantly reduce the cost of IT by as much as 65%. We eliminate much of the waste in the data science pipeline while trying to be as simple as possible.

We should consider empowering and streamlining the companies, non-profits, and government entities such as schools and planning departments that perform vital services before our own lives are greatly effected. Debt and credit are not solutions to complex problems.

Take a look, or don’t. This is a fluff piece on something I am passionately building. Contact us if you are interested in a beta test.

The Case for Using an IRM to Scale Data Intake

Among many, there are three major problems faced by an analyst before data is useful:

  • data aggregation and storage
  • data security and access
  • data wrangling (ETL/ELT)

This article deals with data security and access using an information resource management system, IRM. My own company, Simplr Insites LLC, is writing such a system alongside a file storage solution in an effort to modernize the research process.

Problem

One significant problem faced in research and cooperation is the attainment of clean and useful data. Obtaining this data often means gaining access to systems, forming legal agreements, obfuscating certain data, and embarking on the painful process of data wrangling.

While ETL and ELT are critical steps, just obtaining sensitive data, even from within an organization, is tricky. Consider the following cases related directly to access:

  • data sets include confidential information
  • data sets are ensnared in legal agreements regarding who can access data
  • users want to control access to data to ensure it is not misused
  • external users are allowed varying degrees of access

IRM as a Solution

Oracle generated a solution that attempts to tackle the data security issue. The Oracle IRM documentation provides a rather informative graphical overview of their tool:

irm

In this system, an external user accesses a load balanced IRM server application which controls rights and access to different resources and files. Several firewalls help to improve security along with authentication, access grants, and encryption. Web services  and internal users utilize the IRM server as well.

Beyond the visible components, tokens can be used to instantly manage resources and propagate access changes.

Most file systems also offer the capability to pull the date when a resource was created or modified and various permissions information. This is useful for logging purposes.

Setting Up an IRM

It is not necessary to rely on Oracle for an IRM solution. In fact, the Oracle IRM only works with Microsoft Windows.

Each component can be paired with a reliable tool, most of which I have blogged about. A set of pairings might include

Base Application and Resource Management Django with Secure Login
REST API Resource Access Django OAuth Toolkit
Access Management Django Oauth Toolkit and a Database System
Individual Resource Tokens Randomly Generated and Hashed Key
File Storage GlusterFS or an Encrpytable File System
Encryption of Resources PyCrypto or a Similar Tool
Firewalls IP Tables or another firewall
Two Step Verification through SMS Twilio
Key Storage Stack Exchange Blackbox
VPN Access Firefox
Logging and Anomaly Detection Elastic APM and the ElkStack

Logging

Logging is critical to security. Logs allow administrators to spot harmful activity, generate statistical models based on usage, and aid in auditing the system.

Tokens

Tokens are a perfect solution for controlling document access in the system. They allow a user to gain access to a document, offer scopes for access, and often contain scopes that grant levels of access to a resource.

A user should be required to log in to the application to retrieve a token which refreshes on a regular schedule. These tokens can be revoked and changed by a resource owner or administrator much like using a file system.

Fernet Encryption

While RSA encryption is useful for two way encryption, Fernet encryption is stronger and more useful for storing files. If a system does not offer encryption, tools such as PyCrypto offer Fernet encryption.

Storing Keys

Keys should not be stored in the open. If compromised, it is extremely easy to gain access to a key stored in plain text. Instead, tools such as Stack Exchange’s Blackbox store keys in a system backed by a GPG key ring.

Two Step Downloading for Extra Security

Downloading a file in a secure manner might require extra protection, particularly when an external but trusted user desires access to a resource. To avoid spoofing and avoid a compromised computer from gaining access to a resource, two step verification is a recommended step.

In this process the external user provides an access token to obtain a document which is verified. On verification, a text message containing an access code is sent to the external user and the internal user is notified of the access. The external user enters the code and, if required, the resource owner or admin approves the download.

This type of process is not difficult to implement through desktop or web applications using push notifications or persistent storage.

Conclusion

Secured yet accessible storage is a critical problem for any data analyst or scientist. Using an established IRM or implementing a similar tool helps secure access and empower analytics.

The Case for Microservices, Where To Segment

micro

There is a growing need for microservices and shared services in the increasingly complex and vibrant set of technologies a true IT firm runs. Licensing, authentication, database services, ETL, reporting, analytics, information management, and the plethora of tasks being completed on the backend are impossible to accomplish in only a single application.

This article examines boundaries discovered through my own company’s experience in building microservice related applications.

Related Articles:

Discovering Sharable Resources in a Microservices Environment

Security in a Microservices Environment

Segment On Need and Resource Usage

To be fair, where segmentation of systems occurs depends on the need for each service. Clients may need a core set of tasks to be completed in one area or another. Where those needs diverge is a perfect boundary for establishing a service.

For instance, our clients need ETL, secured cloud file storage, data sharing, text management, FERPA/HIPP and legally compliant storage of data, analytics, data streaming, surveying, and reporting. Each of these areas encompasses one company or another but is cheaper done under a single roof to the tune of $7000 in savings per employee per year at a small to medium sized company.

Our boundaries are specified directly around needs, security, and resource costs. ETL encompasses one boundary due to computation costs, cloud storage another for security reasons, logging for legal compliance another, analytics takes up another service due to computational costs, stream and survey intake and initial analysis comprises another more vulnerable piece, and reporting yet another. Overlapping everything is a service for authorization and the authentication of access rights through oauth2.

The different services were chosen for one of the following factors:

  • resource cost
  • shared tasks and resources
  • legal compliance and security

Segmenting for Security

The modern world is growing increasingly security and privacy conscious. Including authentication systems and the storage of information on the same system as a web server is not recommended.

Microservices allow for individual applications to be separated and controlled. Access can be granted to specific clusters based on a firewall and authentication. Even user access control is easier to maintain. Hardware boundaries can be easily established between vulnerable pieces of a system.

Essentially, never stick a vulnerable frontend, streaming, or survey application on the same hardware as your potentially identifying initial file storage and always have some sort of authentication and access rights mechanism.

Results

Our boundaries are helping us scale. Simplr Insites, LLC dedicates individual resources as needed to each service. It also allows the company to offer a pricing scheme offering variable levels of services tailored to a customers needs more easily.

Some clients do not need an ETL system and only want case note management. That is possible. At the same time, granting GPU resources to the analytics cluster while giving our reporting cluster more RAM is as well.

In essence, Simplr Insites was able to reduce the cost of running systems in a 42 U shared space, possibly by as much as $5000 per month for our small company, while remaining more secure and delivering faster and tailored results based on the needs of clients through a single web frontend based SAAS application.

Conclusion

Discovering where to place microservice boundaries is critical to the success of an application. It relies on many factors ranging from resource cost, to the ability to share resources, and even legal compliance and security. Appropriate splitting of services can reduce cost and increase speed.

ETL 1 Billion Rows in 2.5 Hours Without Paying on 4 cores and 7gb of RAM

There are a ton of ETL tools in the world. Alteryx, Tableau, Pentaho. This list goes on. Out of each, only Pentaho offers a quality free version. Alteryx prices can reach as high as $100,000 per year for a six person company and it is awful and awfully slow. Pentaho is not the greatest solution for streaming ETL either as it is not reactive but is a solid choice over the competitors.

How then, is it possible to ETL large datasets, stream on the same system from a TCP socket, or run flexible computations at speed. Surprisingly, this article will describe how to do just that using Celery and a tool which I am currently working on, CeleryETL.

Celery

Python is clearly an easy language to learn over others such as Scala, Java, and, of course, C++. These languages handle the vast majority of tasks for data science, AI, and mathematics outside of specialized languages such as R. They are likely the front runners in building production grade systems.

In place of the actor model popular with other languages, Python, being more arcane and outdated than any of the popular languages, requires task queues. My own foray into actor systems in Python led to a design which was, in fact, Celery backed by Python’s Thespian.

Celery handles tasks through RabbitMQ or other brokers claiming that the former can achieve up to 50 million messages per second. That is beyond the scope of this article but would theoretically cause my test case to outstrip the capacity of my database to write records. I only hazard to guess at what that would do to my file system.

Task queues are clunky, just like Python. Still, especially with modern hardware, they get the job done fast, blazingly fast. A task is queued with a module name specified as modules are loaded into a registry at run time. The queues, processed by a distributed set of workers running much like an actor in Akka, can be managed externally.

Celery allows for task streaming through chains and chords. The technical documentation is quite extensive and requires a decent chunk of time to get through.

Processing at Speed

Processing in Python at speed requires little more than properly chunking operations, batching record processing appropriately to remove latency, and performing other simple tasks as described in the Akka streams documentation. In fact, I wrote my layer on Celery using the Akka streams play book.

The only truly important operation, chunk your records. When streaming over TCP, this may not be necessary unless TCP connections happen extremely rapidly. Thresholding in this case may be an appropriate solution. If there are more connection attempts than can be completed at once, buffer requests and empty the buffer appropriately upon completion of each chain. I personally found that a maximum bucket size of 1000 for typical records was appropriate and 100 for large records including those containing text blobs was appropriate.

Take a look at my tool for implementation. However, I was able to remap,  split fields to rows, perform string operations, and write to my Neo4J graph database at anywhere from 80,000 to 120,000 records per second.

Conclusion

While this article is shorter than my others, it is something I felt necessary to write in the short time I have to write it. This discovery allows me to write a single language system through Celery, Neo4J, Django, PyQt, and PyTorch for an entire company. That, is phenomenal and only rivaled by Scala which is, sadly, dying despite being a far superior, faster, and less arcane language. By all measures, Scala should have won over the data science community but people detest the JVM. Until this changes, there is Celery.

 

Messaging, ETL, and an AKKA Proposal

Data sources are becoming many. NoSQL can help aggregate multiple sources into a more coherent whole. Akka, which can split data across multiple sources, servers as a perfect way of writing distributed systems. The combination with messaging via Queues or Topics and the Master-Slave pattern could provide a significant boost to ETL. Using databases as messaging systems, it is easy to see how processes can kick start. My goal will be to create a highly concurrent system that takes data from a scraper, from any source as can be done with my Python crawl modules, write the data to a NoSQL based JSONB store in PostgreSQL, notify a set of parsers which then look at patterns in the data to determine how to ETL the data. This is not really revolutionary but a good test of concurrency and automation.

Results will be reported.

Collection with NoSQL and Storage with SQL

There are four really well known forms of NoSQL databases. They are key-value, document, column-family, and graph databases. In the case of ETL, key-value is a good way to expand data without worrying about what if anything is present. However, even in demoralized form, this is not the best storage solution for customer facing solutions. Therefore, data will be placed into a client facing database configured with relational PostgreSQL tables.

Messaging and Building Patterns for AKKA and Scala

With messaging and state machines, actual uses for an actor do not need to be known at runtime. During runtime, interactions or patterns force the actor to take on a different role. This can be accomplished with a simple case-switch statement. From here a message with the data to be transformed can be passed to an actor. This data, with a rowID, can then be parsed after an Actor reads a message from a Queue. The queue specifies conditions such as which Parser-Combinator to use and then completes an activity based on this. This is not incredibly different from the Message slip Pattern, just that no re-routing occurs.

The data would be aggregated using the available row ideas in batches of a certain size. Perhaps batch iterators would best do the trick in determining the size of the batch to process.

Returning Data back to the original Actor

Returning the data requires messaging as well. The message returns from the initial actor where it needs to be matched with the appropriate row.

Recap

To recap, the question is, can AKKA perform more generic ETL than comes in currently available Open Source Tools?

To test this question I am developing Akka ETL. The tool will take in scraped data (from processes that can be managed with the same messaging technique but not easily distributed due to statefullness and security). The design includes taking in completed sources from a database, acquiring data, messaging an Actor with the appropriate parsing information, receiving the transformed data from these actors and posting to a relational database.

The real tests will be maintaining data-deduplication, non-mixed data, and unique identifiers.