2019 Trend: Data Engineering Becomes a Household Name

2019

Stuck, Flickr

There will be many 2019 trends that last well beyond the year. With Tableau now being a household name, Salesforce being a  workhorse for analytics, SAS continuing to grow through jmp, and small players such as Panoply acquiring funding, one hot 2019 trend in technology will be data engineering.

This underlies a massive problem in  the field of data. Data management tools and frameworks are severely deficient. Many merely perform materialization.

That is changing this year, and it means that data engineering will be an important term over the next few years. Automation will become a reality.

What is a data engineer?

Data engineers create pipelines. This means automating the handling of data from aggregation and ingestion to the modeling and reporting process. These professionals handle big data or even small data loads with streaming playing an important role in their work.

As they cover the entire pipeline for your data and often implement analytics in a repeatable manner, data engineering is a broad task. Terms such as ETL, ELT, verification, testing, reporting, materialization, standardization, normalization, distributed programming, crontab, Kubernetes, microservices, Docker, Akka, Spark, AWS, REST, Postgres, Kafka, and statistics are commonly slung with ease by data engineers.

Until 2019, integrating systems often meant combing a variety of tools into a cluttered wreck. A company might deploy python scripts for visualization, Vantara (formerly Pentaho) for ETL, use a variety of aggregation tools combined in Kafka, have data warehouses in PostgreSQL, and may even still use Microsoft Excel to Store data.

The typical company spends $4000 – $8000 per employ providing these pipelines. This cost is unacceptable and can be avoided in the coming years.

Why won’t ELT kill data engineers?

ELT applications promise to get rid of data engineers.  That is pure nonsense meant to attract ignorant investors money:

  • ELT is horrible for big data with continuous ETL proving more effective
  • ELT is often performed on data sources that already underwent ETL by the companies it was purchased from such as Acxiom, Nasdaq, and TransUnion
  • ELT eats resources in a significant way and often limits its use to small data sets
  • ELT ignores issues related to streaming from surveys and other sources which greatly benefit from the requirements analysis and transformations of ETL
  • ELT is horrible for integration tasks where data standards differ or are non-existent
  • You cannot run good AI or build models on poorly or non-standardized data

This means that ETL will continue to be an major part of a data engineers job.

Of course, since data engineers translate business analyst requirements into reality, the job will continue to be secure. Coding may become less important as new products are released but will never go away in the most efficient organizations.

Why is Python Likely to become less popular?

Many people point to Python as a means for making data engineers redundant. This is simply false.

Python is limited. This means that jvm will rise in popularity with data scientist and even analysts as companies want to make money on the backs of their algorithms. This benefits data engineers who are typically proficient in at least Java, Go, or Scala.

Python works for developers, analysts, and data scientists who want to control tools written in a more powerful language such as C++ or Java. Pentaho experimented with the language being bought by Hitachi. However, being 60 times slower than the JVM and often requiring three times the resources,  it is not an enterprise-grade language.

Python does not provide power. It is not great at parallelism and is single threaded. Any language can achieve parallelism. Python uses heavy OS threads to perform anything asynchronously. This is horrendous.

Consider the case of using Python’s Celery versus Akka, a Scala and Java-based tool. Celery and Akka perform the same tasks across a distributed system.

Parsing millions of records in celery can quickly eat up more than fifty percent of typical server resources with a mere ten processes. RabbitMQ, the messaging framework behind Celery, can only parse 50 million messages per second on a cluster. Depending on the use case, Celery may also require Redis to run effectively. This means that an 18 logical core server with 31 gigabytes of RAM can be severely bogged down processing tasks.

Akka, on the other hand, is the basis for Apache Spark. It is lightweight and all inclusive. 50 million messages per second are attainable with 10 million actors running concurrently at much less than fifty percent of typical servers resources. With not every use case requiring spark, even in data engineering, this is an outstanding difference. Not needing a message routing and results backend means that less skill is required for deployment as well.

Will Scala become popular again?

When I started programming in Scala, the language was fairly unheard of. Many co-workers merely looked at this potent language as a curiosity. Eventually, Scala’s popularity started to wain as java developers were still focused on websites and ignored creating the same frameworks for Scala that exist in Python.

That is changing. With the rise of R, whose syntax is incredibly similar to Scala, mathematicians and analysts are gaining skill in increasingly complex languages.

Perhaps due to this, Scala is making it back into the lexicon of developers. The power of Python was greatly reduced in 2017 as non-existent or previously non-production level tools were released for the JVM.

Consider what is now at least version 1.0 in Scala:

  • Nd4j and Nd4s: A Scala and Java-based non-dimensional array framework that boasts speeds faster than Numpy
  • Dl4J: Skymind is a terrific company producing tools comparable to torch
  • Tensor Flow: Contains APIs for both Java and Scala
  • Neanderthal: A Clojure based linear algebra system that is blazing fast
  • OpenNLP: A new framework that, unlike the Stanford NLP tools, is actively developed and includes named entity recognition and other powerful transformative tools
  • Bytedeco: This project is filled with angels (I actually think they came from heaven) whose innovative and nearly automated JNI creator has created links to everything from Python code to Torch, libpostal, and OpenCV
  • Akka: Lightbend continues to produce distributed tools for Scala with now open sourced split brain resolvers that go well beyond my majority resolver
  • MongoDB connectors: Python’s MongoDB connectors are resource intensive due to the rather terrible nature of Python byte code
  • Spring Boot: Scala and Java are interoperable, but benchmarks of Spring Boot show at least a 10000 request per second improvement over Django
  • Apereo CAS: A single sign-on system that adds terrific security to disparate applications

Many of these frameworks are available in Java.  Since Scala runs any Java programs, the languages are interoperable. Scala is cleaner, functional, highly modular, and requires much less code than Java which puts these tools in the reach of analysts.

What do new tools mean for a data engineer?

The new Java, Scala, and Go tools and frameworks mean that attaining 1000 times the speed on a single machine with significant cost reduction over Python is possible. It also means chaining millions of moving parts to a solid microservices architecture system instead of a cluttered monolithic wreck.

The result is clear. My own company is switching off of Python everywhere except for our responsive and front end heavy web application for a fifty percent cost reduction in hardware.

How will new tools help data engineers?

With everything that Scala and the JVM offers, Data Engineers now have a potent tool for automation. These valuable employees may not be creating the algorithms, but they will be transforming data in smart ways that produce real value.

Companies no longer have to rely on archaic languages to produce messy systems, and this will translate directly into value. Data engineers will be behind this increase in value as they can more easily combine tools into a coherent and flexible whole.

Conclusion

The continued rise of JVM backed tools starting in 2018 will make data pipeline automation a significant part of a company’s IT cost. Data engineers will be behind the evolution of data pipelines from disparate systems to a streamlined whole backed by custom code and new products.

Data engineering will be a hot 2019 trend. After this year, we may just be seeing the creation of Skynet.

Advertisements

Fluff Stuff: Better Governments, Better Processes, Simplr Insites

Cities are heading towards bankruptcy. The credit rating of Stockton, CA was downgraded. Harrisburg, PA is actually bankrupt.  It is only a matter of time before Chicago implodes. Since 1995, city debt rose by between $1.3 and $1.8 trillion. While a large chunk of this cost is from waste, more is the result of using intuition over data when tackling infrastructure and new projects. Think of your city government as the boss who likes football more than his job so he builds a football stadium despite your company being in the submarine industry.

This is not an unsolvable nightmare.

Take an effective use case where technologies and government processes were outsourced. As costs rose in Sandy Sprints, GA, the city outsourced and achieved more streamlined processes, better technology, and lower costs. Today, without raising taxes, the city is in the green. While Sandy Springs residents are wealth, even poorer cities can learn from this experience.

Cities run projects in an extremely scientific manner and require an immense amount of clean, quality, well-managed data isolated into individual projects to run appropriately. With an average of $8000 spent per employee on technology each year and with an immense effort spent in acquiring analysts and modernizing infrastructure, cities are struggling to modernize.

It is my opinion, one I am basing a company on, that the provision of quality data management, sharing and collaboration tools, IT infrastructure, and powerful project and statistical management systems in a single SAAS tool can greatly reduce the $8000 per employee cost and improve budgets. These systems can even reduce the amount of administrative staff, allowing money to flow to where it is needed most.

How can a collaborative tool tackle the cost problem. Through:

  • collaborative knowledge sharing of working, ongoing, and even failed solutions
  • public facing project blogs and information on organizations, projects, statistical models, results, and solutions that allow even non-mathematical members of an organization to learn about a project
  • a security minded institutional resource manager (IRM better thought of as a large, securable, shared file storage system) capable of expanding into the petabytes while maintaining FERPA, HIPPA, and state and local regulations
  • the possibility to share data externally, keep it internal, or keep the information completely private while obfuscating names and other protected information
  • complexity analysis (graph based analysis) systems for people, projects, and organizations clustered for comparison
  • strong comparison tools
  • potent and learned aggregation systems with validity in mind ranging from streamed data from sensors and the internet to surveys to uploads
  • powerful drag and drop integration and ETL with mostly automated standardization
  • deep diving upload, data set, project, and model exploration using natural language searching
  • integration with everything from a phone to a tablet to a powerful desktop
  • access controls for sharing the bare minimum amount of information
  • outsourced IT infrastructure including infrastructure for large model building
  • validation using proven algorithms and increased awareness of what that actually means
  • anomaly detection
  • organization of models, data sets, people, and statistical elements into a single space for learning
  • connectors to popular visualizers such as Tableau and Vantara with a customize-able dashboard for general reporting
  • downloadable sets with two entity verification if required that can be streamed or used in Python and R

Tools such as ours significantly reduce the cost of IT by as much as 65%. We eliminate much of the waste in the data science pipeline while trying to be as simple as possible.

We should consider empowering and streamlining the companies, non-profits, and government entities such as schools and planning departments that perform vital services before our own lives are greatly effected. Debt and credit are not solutions to complex problems.

Take a look, or don’t. This is a fluff piece on something I am passionately building. Contact us if you are interested in a beta test.

Avoid Race Conditions with a Python3 Manager

 

race

In this short article, we examine the potential for race conditions in Python’s multiprocessing queue and how to potentially avoid this problem.

Race Condition

At times, the standard multi-processing queue can infinitely block despite the success of similar code. It appears that the reader-writer lock issue plagues the standard multiprocessing Queue. There is a bad egg at the table.

The following code runs without issue:

class OutMe(object):

    def __init__(self, q):
        self.q = q

    def do_put(self):
        self.q.put("Received")

    def rec(self):
        pass


class QMe(OutMe):

    def __init__(self, q, iq):
        self.q = q
        self.iq = iq
        OutMe.__init__(self, q)

    def rec(self):
        msg = self.iq.get(timeout=20)
        return msg

    def run(self):
        pass


class MProc(Process, QMe):

    def __init__(self, queue, oqueue):
        self.queue = queue
        self.oqueue = oqueue.get('oq')
        print(self.queue)
        Process.__init__(self)
        QMe.__init__(self, self.oqueue, self.queue)

    def run(self):
        while True:
            msg = self.rec()
            print(msg)
            self.do_put()


if __name__ == "__main__":
    mq = Queue()
    oq = {'oq': Queue()}
    mp = MProc(mq, oq)
    mp.start()
    while True:
        mq.put('Put')
        sleep(random.randint(0,3))
        print(oq.get('oq').get())

However, there are similar circumstances where the Queue blocks despite having a queue size larger than 0. This issue often appears when inheriting other objects and was submitted as a bug to the Python Foundation in 2013.

Python Manager

A python Manager coordinates data between different processes.  The Manager is a server process that is better capable of handling race conditions.

Where the above approach fails, it is more likely that the following will succeed:

if __name__ == "__main__":
    mgr = Manager()
    mgr2 = Manager()
    mq = mgr.Queue()
    oq = {'oq': mgr2.Queue()}
    mp = MProc(mq, oq)
    mp.start()
    while True:
        mq.put('Put')
        sleep(random.randint(0,3))
        print(oq.get('oq').get())

In this example, the queue from the previous code is created using a manager. The manager contains a proxy object in place of the queue.  The server process handles access to the queue.

Of note, the actual queue may contain nested yet shared items since Python 3.6.

Conclusion

Python is decades old.  This creates some problems. The standard multi-processing queue is plagued by race conditions. A python Manager helps resolve this issue.

Running a Gevent StreamServer in a Thread for Maximum Control

There are times when serving requests needs to be controlled without forcibly killing a running server instance. It is possible to do this with Python’s gevent library. In this article, we will examine how to control the gevent StreamServer through an event.

All code is available on my Github through an actor system project which I intend to use in production (e.g. it will be completed).

Greenlet Threads

Gevent utilizes green threads through the gevent.greenlet package. A greenlet, a Green thread in gevent, utilizes a similar API to the Python asyncio library but with more control over scheduling.

The greenlet is scheduled by the program instead of the OS and works more akin to threading in the JVM. Greenlet threads run code concurrently but not in parallel, although this could be achieved by starting greenlets in a new process.

In this manner, it is possible to schedule Greenlet threads on a new operating system based thread or, more likely due to the GIL, in a new process to achieve a similar level of concurrency and even parallelism to Java. Of course, the cost of parallelism must be considered.

StreamServer

Gevent maintains a server through gevent.server.StreamServer. This server utilizes greenlet threads to concurrently handle requests.

A StreamServer takes an address and port and can optionally be given a pool for controlling the number of connections created:

pool = Pool(MAX_THREADS)
server = StreamServer((self.host, self.port), handle_connect, spawn=pool)

This concurrency allows for faster communication in the bi-directional way that troubles asyncio.

 Gracefully Exiting a StreamServer

In instances where we want to gracefully exit a server while still running code, it is possible to use the gevent.event.Event class to communicate between threads and force the server to stop.

As a preliminary note, it is necessary to call the gevent.monkey.patch_all method in the native thread to allow for cross-thread communications.

from gevent import monkey

monkey.patch_all()

Instead of using the serve_forever method, we must utilize the start and stop methods. this must also be accompanied by the use of the Event class:

evt = Event()
pool = Pool(MAX_THREADS)
server = StreamServer((self.host, self.port), handle_connect, spawn=pool)
server.start()
gevent.signal(signal.SIGQUIT, self.evt.set)
gevent.signal(signal.SIGTERM, self.evt.set)
gevent.signal(signal.SIGINT, self.evt.set)
self.signal_queue.put(ServerStarted())
atexit.register(self.__server.stop)
evt.wait()
server.stop(10)

For good measure, termination signals are handled through gevent.signal to allow the server to close and kill the thread in case of user based termination. This hopefully leaves no loose ends.

The event can be set externally to force the server to stop with the preset 10 second timeout:

evt.set()

Conclusion

In this article, we examined how a greenlet thread or server can be controlled using a separate thread for graceful termination without exiting a running program. An Event was used to achieve our goal.

Akka: An Introduction

Akkas documentation is immense. This series helps tackle the many components by providing a working example of the master slave design pattern built with this powerful tool. The following article reviews the higher level concepts behind Akka and its usage.

Links are provided to different parts of the Akka documentation throughout the article.

See also:

Akka

Akka is a software tool  used to build multi-threaded and distributed systems based on the actor model. It takes care of lower level systems building by providing high level APIs for node and actor generation.

Actors are the primitives behind Akka. They are useful for performing repeated tasks concurrently.  Actors run until terminated, receiving work through message passing.

actorsys

Resource Usage

Akka is extremely lightweight. The creators boast that the tool can handle thousands of actors on a single machine.

Message passing occurs through mailboxes. The maximum number of messages a mailbox holds is configurable with a 1000 messages default but messages must be under one megabyte.

The Actor

The actor is the universal primitive used in Akka. Unlike when using threading in a program language, this primitive runs like a daemon server. As such, it should be shut down gracefully.

Actors are user created.

class MyActor extends Actor with ActorLogging{

     override def preStart()= log.debug("Starting")
     override def postStop()= log.debug("Stopping")
     override def preRestart(reason: Throwable, message: Option[Any]) = log.error(s"Restarting because of ${reason.message}. ${message}")     
     override def postRestart(reason : Throwable) = 

     override def receive():Receive={
         case _ => sender ! "Hello from Actor"
     }
}

object MyActor{
   def setupMyActor()={
        val conf = ConfigFactory.load()
        val system = ActorSystem("MySystem",conf)
        val actor : ActorRef = system.actorOf(Props[MyActor],name = "myactor") 
   }
}

 

The example above creates an actor and a Scala companion class for instantiation.

Actors must extend Actor. ActorLogging provides the log library. The optional functions preRestart and postRestart handle exceptions, while the optional preStart and postStop methods handle setup and tear down tasks. The basic actor above incorporates logging and error processing.

An actor can:

  • Create and supervise other actors
  • Maintain a State
  • Control the flow of work in a system
  • Perform a unit of work on request or repeatably
  • Send and receive messages
  • Return the results of a computation

Akka’s serialization is extremely powerful. Anything available across a cluster or on the classpath that implements Serializable can be sent to and from an actor. Instances of classes are de-serialized without having the programmer recast them.

When to Use Akka

Actor systems are not a universal solution. When not performing repeated tasks and not benefiting from high levels of concurrency, they are a hindrance.

State persistence also weighs heavily in the use of an actor system. Take the example of cookies in network requests. Maintaining different network sessions across different remote actors can be highly beneficial in ingestion.

Any task provided to an actor should contain very little new code and a limited number of configuration variables.

Questions that should be asked based on these concepts include:

  • Can I break up tasks into sufficiently large quantities of work to benefit from concurrency?
  • How often will  tasks be repeated in the system?
  • How minimal can I make the configuration for the actor if necessary?
  • Is there a need for state persistence?
  • Is there a significant need for concurrency or is it a nice thought?
  • Is there a resource constraint that distribution can solve or that will limit threading?

State is definitely a large reason to use Akka. This could be in the form of actually  maintaining variables or in the actor itself.

In some distributed use cases involving the processing of enormous numbers of short lived requests, the actors own state and Akka’s mailbox capabilities are what is most important. This is the reasoning behind tools built on Akka such as Spark.

As is always the case when deciding to create a new system, the following should be asked as well:

  • Can I use an existing tools such as Spark or Tensor Flow?
  • Does the time required to build the system outweigh the overall benefit it will provide?

Clustering

Clustering is available in Akka. Akka provides high level APIs for generating distributed clusters. Specified seed nodes handle communications, serving as the system’s entry point.

Network design is provided entirely by the developer. Since node generation, logging, fault tolerance, and basic communication are the only pieces of a distributed system Akka handle’s, any distribution model will suffice. Two common models are the master-slave and graph based models.

Akka clusters are resilient with well developed fault tolerance.

Configuration

Configuration is performed either through a file or in a class. Many components can be configured including logging levels, cluster components, metrics collection, and routers.

Conclusion

This article is the entry point for the Akka series, providing the basic understanding needed as we begin to build a cluster.

Messaging, ETL, and an AKKA Proposal

Data sources are becoming many. NoSQL can help aggregate multiple sources into a more coherent whole. Akka, which can split data across multiple sources, servers as a perfect way of writing distributed systems. The combination with messaging via Queues or Topics and the Master-Slave pattern could provide a significant boost to ETL. Using databases as messaging systems, it is easy to see how processes can kick start. My goal will be to create a highly concurrent system that takes data from a scraper, from any source as can be done with my Python crawl modules, write the data to a NoSQL based JSONB store in PostgreSQL, notify a set of parsers which then look at patterns in the data to determine how to ETL the data. This is not really revolutionary but a good test of concurrency and automation.

Results will be reported.

Collection with NoSQL and Storage with SQL

There are four really well known forms of NoSQL databases. They are key-value, document, column-family, and graph databases. In the case of ETL, key-value is a good way to expand data without worrying about what if anything is present. However, even in demoralized form, this is not the best storage solution for customer facing solutions. Therefore, data will be placed into a client facing database configured with relational PostgreSQL tables.

Messaging and Building Patterns for AKKA and Scala

With messaging and state machines, actual uses for an actor do not need to be known at runtime. During runtime, interactions or patterns force the actor to take on a different role. This can be accomplished with a simple case-switch statement. From here a message with the data to be transformed can be passed to an actor. This data, with a rowID, can then be parsed after an Actor reads a message from a Queue. The queue specifies conditions such as which Parser-Combinator to use and then completes an activity based on this. This is not incredibly different from the Message slip Pattern, just that no re-routing occurs.

The data would be aggregated using the available row ideas in batches of a certain size. Perhaps batch iterators would best do the trick in determining the size of the batch to process.

Returning Data back to the original Actor

Returning the data requires messaging as well. The message returns from the initial actor where it needs to be matched with the appropriate row.

Recap

To recap, the question is, can AKKA perform more generic ETL than comes in currently available Open Source Tools?

To test this question I am developing Akka ETL. The tool will take in scraped data (from processes that can be managed with the same messaging technique but not easily distributed due to statefullness and security). The design includes taking in completed sources from a database, acquiring data, messaging an Actor with the appropriate parsing information, receiving the transformed data from these actors and posting to a relational database.

The real tests will be maintaining data-deduplication, non-mixed data, and unique identifiers.