2019 Trend: Data Engineering Becomes a Household Name

2019

Stuck, Flickr

There will be many 2019 trends that last well beyond the year. With Tableau now being a household name, Salesforce being a  workhorse for analytics, SAS continuing to grow through jmp, and small players such as Panoply acquiring funding, one hot 2019 trend in technology will be data engineering.

This underlies a massive problem in  the field of data. Data management tools and frameworks are severely deficient. Many merely perform materialization.

That is changing this year, and it means that data engineering will be an important term over the next few years. Automation will become a reality.

What is a data engineer?

Data engineers create pipelines. This means automating the handling of data from aggregation and ingestion to the modeling and reporting process. These professionals handle big data or even small data loads with streaming playing an important role in their work.

As they cover the entire pipeline for your data and often implement analytics in a repeatable manner, data engineering is a broad task. Terms such as ETL, ELT, verification, testing, reporting, materialization, standardization, normalization, distributed programming, crontab, Kubernetes, microservices, Docker, Akka, Spark, AWS, REST, Postgres, Kafka, and statistics are commonly slung with ease by data engineers.

Until 2019, integrating systems often meant combing a variety of tools into a cluttered wreck. A company might deploy python scripts for visualization, Vantara (formerly Pentaho) for ETL, use a variety of aggregation tools combined in Kafka, have data warehouses in PostgreSQL, and may even still use Microsoft Excel to Store data.

The typical company spends $4000 – $8000 per employ providing these pipelines. This cost is unacceptable and can be avoided in the coming years.

Why won’t ELT kill data engineers?

ELT applications promise to get rid of data engineers.  That is pure nonsense meant to attract ignorant investors money:

  • ELT is horrible for big data with continuous ETL proving more effective
  • ELT is often performed on data sources that already underwent ETL by the companies it was purchased from such as Acxiom, Nasdaq, and TransUnion
  • ELT eats resources in a significant way and often limits its use to small data sets
  • ELT ignores issues related to streaming from surveys and other sources which greatly benefit from the requirements analysis and transformations of ETL
  • ELT is horrible for integration tasks where data standards differ or are non-existent
  • You cannot run good AI or build models on poorly or non-standardized data

This means that ETL will continue to be an major part of a data engineers job.

Of course, since data engineers translate business analyst requirements into reality, the job will continue to be secure. Coding may become less important as new products are released but will never go away in the most efficient organizations.

Why is Python Likely to become less popular?

Many people point to Python as a means for making data engineers redundant. This is simply false.

Python is limited. This means that jvm will rise in popularity with data scientist and even analysts as companies want to make money on the backs of their algorithms. This benefits data engineers who are typically proficient in at least Java, Go, or Scala.

Python works for developers, analysts, and data scientists who want to control tools written in a more powerful language such as C++ or Java. Pentaho experimented with the language being bought by Hitachi. However, being 60 times slower than the JVM and often requiring three times the resources,  it is not an enterprise-grade language.

Python does not provide power. It is not great at parallelism and is single threaded. Any language can achieve parallelism. Python uses heavy OS threads to perform anything asynchronously. This is horrendous.

Consider the case of using Python’s Celery versus Akka, a Scala and Java-based tool. Celery and Akka perform the same tasks across a distributed system.

Parsing millions of records in celery can quickly eat up more than fifty percent of typical server resources with a mere ten processes. RabbitMQ, the messaging framework behind Celery, can only parse 50 million messages per second on a cluster. Depending on the use case, Celery may also require Redis to run effectively. This means that an 18 logical core server with 31 gigabytes of RAM can be severely bogged down processing tasks.

Akka, on the other hand, is the basis for Apache Spark. It is lightweight and all inclusive. 50 million messages per second are attainable with 10 million actors running concurrently at much less than fifty percent of typical servers resources. With not every use case requiring spark, even in data engineering, this is an outstanding difference. Not needing a message routing and results backend means that less skill is required for deployment as well.

Will Scala become popular again?

When I started programming in Scala, the language was fairly unheard of. Many co-workers merely looked at this potent language as a curiosity. Eventually, Scala’s popularity started to wain as java developers were still focused on websites and ignored creating the same frameworks for Scala that exist in Python.

That is changing. With the rise of R, whose syntax is incredibly similar to Scala, mathematicians and analysts are gaining skill in increasingly complex languages.

Perhaps due to this, Scala is making it back into the lexicon of developers. The power of Python was greatly reduced in 2017 as non-existent or previously non-production level tools were released for the JVM.

Consider what is now at least version 1.0 in Scala:

  • Nd4j and Nd4s: A Scala and Java-based non-dimensional array framework that boasts speeds faster than Numpy
  • Dl4J: Skymind is a terrific company producing tools comparable to torch
  • Tensor Flow: Contains APIs for both Java and Scala
  • Neanderthal: A Clojure based linear algebra system that is blazing fast
  • OpenNLP: A new framework that, unlike the Stanford NLP tools, is actively developed and includes named entity recognition and other powerful transformative tools
  • Bytedeco: This project is filled with angels (I actually think they came from heaven) whose innovative and nearly automated JNI creator has created links to everything from Python code to Torch, libpostal, and OpenCV
  • Akka: Lightbend continues to produce distributed tools for Scala with now open sourced split brain resolvers that go well beyond my majority resolver
  • MongoDB connectors: Python’s MongoDB connectors are resource intensive due to the rather terrible nature of Python byte code
  • Spring Boot: Scala and Java are interoperable, but benchmarks of Spring Boot show at least a 10000 request per second improvement over Django
  • Apereo CAS: A single sign-on system that adds terrific security to disparate applications

Many of these frameworks are available in Java.  Since Scala runs any Java programs, the languages are interoperable. Scala is cleaner, functional, highly modular, and requires much less code than Java which puts these tools in the reach of analysts.

What do new tools mean for a data engineer?

The new Java, Scala, and Go tools and frameworks mean that attaining 1000 times the speed on a single machine with significant cost reduction over Python is possible. It also means chaining millions of moving parts to a solid microservices architecture system instead of a cluttered monolithic wreck.

The result is clear. My own company is switching off of Python everywhere except for our responsive and front end heavy web application for a fifty percent cost reduction in hardware.

How will new tools help data engineers?

With everything that Scala and the JVM offers, Data Engineers now have a potent tool for automation. These valuable employees may not be creating the algorithms, but they will be transforming data in smart ways that produce real value.

Companies no longer have to rely on archaic languages to produce messy systems, and this will translate directly into value. Data engineers will be behind this increase in value as they can more easily combine tools into a coherent and flexible whole.

Conclusion

The continued rise of JVM backed tools starting in 2018 will make data pipeline automation a significant part of a company’s IT cost. Data engineers will be behind the evolution of data pipelines from disparate systems to a streamlined whole backed by custom code and new products.

Data engineering will be a hot 2019 trend. After this year, we may just be seeing the creation of Skynet.

Advertisements

The Case for Using an IRM to Scale Data Intake

Among many, there are three major problems faced by an analyst before data is useful:

  • data aggregation and storage
  • data security and access
  • data wrangling (ETL/ELT)

This article deals with data security and access using an information resource management system, IRM. My own company, Simplr Insites LLC, is writing such a system alongside a file storage solution in an effort to modernize the research process.

Problem

One significant problem faced in research and cooperation is the attainment of clean and useful data. Obtaining this data often means gaining access to systems, forming legal agreements, obfuscating certain data, and embarking on the painful process of data wrangling.

While ETL and ELT are critical steps, just obtaining sensitive data, even from within an organization, is tricky. Consider the following cases related directly to access:

  • data sets include confidential information
  • data sets are ensnared in legal agreements regarding who can access data
  • users want to control access to data to ensure it is not misused
  • external users are allowed varying degrees of access

IRM as a Solution

Oracle generated a solution that attempts to tackle the data security issue. The Oracle IRM documentation provides a rather informative graphical overview of their tool:

irm

In this system, an external user accesses a load balanced IRM server application which controls rights and access to different resources and files. Several firewalls help to improve security along with authentication, access grants, and encryption. Web services  and internal users utilize the IRM server as well.

Beyond the visible components, tokens can be used to instantly manage resources and propagate access changes.

Most file systems also offer the capability to pull the date when a resource was created or modified and various permissions information. This is useful for logging purposes.

Setting Up an IRM

It is not necessary to rely on Oracle for an IRM solution. In fact, the Oracle IRM only works with Microsoft Windows.

Each component can be paired with a reliable tool, most of which I have blogged about. A set of pairings might include

Base Application and Resource Management Django with Secure Login
REST API Resource Access Django OAuth Toolkit
Access Management Django Oauth Toolkit and a Database System
Individual Resource Tokens Randomly Generated and Hashed Key
File Storage GlusterFS or an Encrpytable File System
Encryption of Resources PyCrypto or a Similar Tool
Firewalls IP Tables or another firewall
Two Step Verification through SMS Twilio
Key Storage Stack Exchange Blackbox
VPN Access Firefox
Logging and Anomaly Detection Elastic APM and the ElkStack

Logging

Logging is critical to security. Logs allow administrators to spot harmful activity, generate statistical models based on usage, and aid in auditing the system.

Tokens

Tokens are a perfect solution for controlling document access in the system. They allow a user to gain access to a document, offer scopes for access, and often contain scopes that grant levels of access to a resource.

A user should be required to log in to the application to retrieve a token which refreshes on a regular schedule. These tokens can be revoked and changed by a resource owner or administrator much like using a file system.

Fernet Encryption

While RSA encryption is useful for two way encryption, Fernet encryption is stronger and more useful for storing files. If a system does not offer encryption, tools such as PyCrypto offer Fernet encryption.

Storing Keys

Keys should not be stored in the open. If compromised, it is extremely easy to gain access to a key stored in plain text. Instead, tools such as Stack Exchange’s Blackbox store keys in a system backed by a GPG key ring.

Two Step Downloading for Extra Security

Downloading a file in a secure manner might require extra protection, particularly when an external but trusted user desires access to a resource. To avoid spoofing and avoid a compromised computer from gaining access to a resource, two step verification is a recommended step.

In this process the external user provides an access token to obtain a document which is verified. On verification, a text message containing an access code is sent to the external user and the internal user is notified of the access. The external user enters the code and, if required, the resource owner or admin approves the download.

This type of process is not difficult to implement through desktop or web applications using push notifications or persistent storage.

Conclusion

Secured yet accessible storage is a critical problem for any data analyst or scientist. Using an established IRM or implementing a similar tool helps secure access and empower analytics.