Scala’s Hidden Benefits

Scala should not be a fad but it should not also be the only language in your toolkit. Java is fast(-ish), Python is dynamic and Scala is to Java what Python is to C code. It is just easier and faster to write Java than C. Companies are moving away from Scala but imagine writing some of the code in Groovy that is written in Scala and a nightmare occurs. Yes GoLang exists now but GoLang cannot work directly with Java in the same project and is somewhat limited and at times slower than Java and Scala.

Scala maintains several fairly direct benefits. It works well at the API layer over code such as Java, it is terrific at condensing data manipulation, and it maintains much better syntax for concurrency, allowing much more straightforward behaviour than anything written in Java. One only needs to look at Scala Swing to understand the immense productivity benefits of Scala on these levels.

My Use Case

My use case is seemingly simple but actually quite complex. Scala’s niche is data. It is perfect for ETL and at using Java networking tools for pulling in data. However, creating a large amount of Scala code is a bad idea. Every line of Scala is several lines of Java. Therefore, in creating an acquisition and ETL system for the company Hygenics Data LLC, the decision was made to make concurrent code in Scala, an API in Scala, ETL in Scala, and most everything else in Java.

This left the majority of the code for the Goat ETL toolset to be written in Java. Goats eat everything and so does the system so the name is perfect. That includes interaction with Pentaho, streaming, basic classes for an API manager, a basic browser written with Apache Http Components and Rhino, and much more. However, concurrent tasks and their managers were written in Scala. The result was actually an incredibly fluent, fast, system that runs extremely lightly and significantly improves speed over Python. The boost was not insignificant and the extra management that can plague Java concurrency and generate multitudes of bugs and bottlenecks disappeared. Java’s countdown latch and other recent improvements were terrific but just don’t carry the simplicity and manageability of Scala.

ETL code was written in Scala and achieved an extreme degree of flexibility with minimal code and improved error handling. Where Java only recently incorporated something akin to a Try or an Option, Scala was practically built on these ideas. Validators, mapping, filtering, reduction, and other tasks are much simple in Scala.

The API was written in Scala entirely. The result was a set of configurable structures that were easy to write and reduced code by as much as fifty percent or more.

Significantly, Scala can make improvements over other classes simple via the Enhancement pattern and offers many benefits in generics, implicits, and other features which Java just does not have. Ask a Java only programmer what invariance is.

Regarding the advantages over Python:

  • The Scala/Java program ran much faster
  • The Scala/Java Program could make use of threads without GIL interference
  • The Scala/Java Program has tools such as Spark, Akka, and Pentaho at its disposal
  • The Apache networking tools allowed a much lower level of interaction
  • Scala’s concurrency tools are much better developed than Python’s

The result is visible on my Github page.

Mixing Code

The mix is actually intuitive and simple in both Intellij and Eclipse where projects generate both sets of folders. While many may try to avoid mixing code, anyone who programs Java should be able to pick up the Scala code fairly simply. A quick search and a bit of time and proficiency in both languages is feasible. It has always been my opinion that anyone should be able to extrapolate between languages and tools or even from languages to tools (e.g. distribution in Carte). The concepts that build a tool or language are not infinite.

Interestingly, it is quite easy to mix Java Spring into Scala as my ProxyManager shows. Since Scala compiles to Java byte code, it was possible to use Spring from Scala, reducing reliance on the clunky and relatively low quality framework that is Play.

PostgreSQL for Converting NoSQL to SQL

In need of a solution to prep for ETL and recognizing that drill may not be as comprehensive as I can build with PGPLSQL as it only goes one layer deep, it was time to find a way to move from dynamically created Jsonb in PostgreSQL to PostgreSQL relational tables.

The solution was this little function. This can be built to use jsonb_split_array and other functions to easily and quickly build up functions that delve deeper than drill. Add the future master replication and seemingly improving distriution and threading to Enterprise DB’s growing set of accomplishments with Postgres and why use drill.

breakoutNoSQL(inputTable text,outputTable text,jsonColumn text,otherColumns text[],condition text)

Only othercolumns and condition can be null.

Code

CREATE OR REPLACE FUNCTION breakoutNoSQL(inputTable text,outputTable text,jsonColumn text,otherColumns text[],condition text,splitData boolean) RETURNS text[] AS
$BODY$
DECLARE
    k text;
    keys text[];
    stmt text;
    insertKeys text;
BEGIN
    IF outputTable IS NULL THEN
        RAISE EXCEPTION 'OUTPUT TABLE CANNOT BE NULL';	
    END IF;

    if inputTable IS NULL THEN
        RAISE EXCEPTION 'INPUT TABLE CANNOT BE NULL';
    END IF;

    --get the initial keys
    if condition IS NOT NULL THEN
       IF splitData IS TRUE THEN
	  execute 'SELECT array_agg(key) FROM (SELECT distinct(jsonb_object_keys(jsonb_array_elements('||jsonColumn||'::jsonb))) as key FROM '||inputTable||') as q1 WHERE '||condition into keys;
       ELSE
	execute 'SELECT array_agg(key) FROM (SELECT distinct(jsonb_object_keys('||jsonColumn||'::jsonb)) as key FROM '||inputTable||') as q1 WHERE '||condition into keys;
       END IF;
    else
       IF splitData IS TRUE THEN
	execute 'SELECT array_agg(key) FROM (SELECT distinct(jsonb_object_keys(jsonb_array_elements('||jsonColumn||'::jsonb))) as key FROM '||inputTable||') as q1' into keys;
       ELSE
	execute 'SELECT array_agg(key) FROM (SELECT distinct(jsonb_object_keys('||jsonColumn||'::jsonb)) as key FROM '||inputTable||') as q1' into keys;
       END IF;
    end if;

    IF keys IS NULL OR array_length(keys,1) = 0 THEN
	RAISE EXCEPTION 'NUMBER OF DISCOVERED KEYS WAS 0';
    END IF;

    --build the statement
    stmt = 'CREATE TABLE '||outputTable||' AS SELECT ';

    --build the insert keys statement 
    insertKeys = NULL;
    FOREACH k IN ARRAY keys LOOP
      if insertKeys IS NULL THEN
         insertKeys = '';
      else
         insertKeys = insertKeys||',';
      end if;
      insertKeys = insertKeys||'btrim(cast('||'j'||jsonColumn||'::jsonb->'||''''||k||''''||'as text),''"'') as '||k;
    END LOOP;

    if otherColumns IS NOT NULL THEN
	FOREACH k IN ARRAY otherColumns LOOP
           if insertKeys IS NULL THEN
            insertKeys = '';
           else
             insertKeys = insertKeys||',';
           end if;  
           insertKeys = insertKeys||k;
       END LOOP;
     END IF;
     	
    --concat to make full statement
    stmt = stmt||' '||insertKeys||' FROM '||' (SELECT *,';
    IF splitData IS TRUE THEN
      stmt = stmt||'jsonb_array_elements('||jsonColumn||'::jsonb) as j'||jsonColumn||' FROM '||inputTable||') as q1';
    ELSE
      stmt = stmt||jsonColumn||' as j'||jsonColumn||' FROM '||inputTable||') as q1';
    END IF;

    RAISE NOTICE 'QUERY: %',stmt;
    
    --execute and print statement
    RAISE NOTICE 'QUERY: %',stmt;
    execute stmt;
    
    --return the keys from json
    return keys;
END;
$BODY$
Language plpgsql;