Data sources are becoming many. NoSQL can help aggregate multiple sources into a more coherent whole. Akka, which can split data across multiple sources, servers as a perfect way of writing distributed systems. The combination with messaging via Queues or Topics and the Master-Slave pattern could provide a significant boost to ETL. Using databases as messaging systems, it is easy to see how processes can kick start. My goal will be to create a highly concurrent system that takes data from a scraper, from any source as can be done with my Python crawl modules, write the data to a NoSQL based JSONB store in PostgreSQL, notify a set of parsers which then look at patterns in the data to determine how to ETL the data. This is not really revolutionary but a good test of concurrency and automation.
Results will be reported.
Collection with NoSQL and Storage with SQL
There are four really well known forms of NoSQL databases. They are key-value, document, column-family, and graph databases. In the case of ETL, key-value is a good way to expand data without worrying about what if anything is present. However, even in demoralized form, this is not the best storage solution for customer facing solutions. Therefore, data will be placed into a client facing database configured with relational PostgreSQL tables.
Messaging and Building Patterns for AKKA and Scala
With messaging and state machines, actual uses for an actor do not need to be known at runtime. During runtime, interactions or patterns force the actor to take on a different role. This can be accomplished with a simple case-switch statement. From here a message with the data to be transformed can be passed to an actor. This data, with a rowID, can then be parsed after an Actor reads a message from a Queue. The queue specifies conditions such as which Parser-Combinator to use and then completes an activity based on this. This is not incredibly different from the Message slip Pattern, just that no re-routing occurs.
The data would be aggregated using the available row ideas in batches of a certain size. Perhaps batch iterators would best do the trick in determining the size of the batch to process.
Returning Data back to the original Actor
Returning the data requires messaging as well. The message returns from the initial actor where it needs to be matched with the appropriate row.
To recap, the question is, can AKKA perform more generic ETL than comes in currently available Open Source Tools?
To test this question I am developing Akka ETL. The tool will take in scraped data (from processes that can be managed with the same messaging technique but not easily distributed due to statefullness and security). The design includes taking in completed sources from a database, acquiring data, messaging an Actor with the appropriate parsing information, receiving the transformed data from these actors and posting to a relational database.
The real tests will be maintaining data-deduplication, non-mixed data, and unique identifiers.