Lack of Nested and Implicit Data Support in Drill, PostgreSQL and Pentaho when Working with Json Data

JSon is great. It can contain a variety of data types in an expected format. It is becoming easier and easier to work with Json in existing formats as well making it a future workhorse for NoSQL based ETL. However, and not in the least because NoSQL ingestion needs to result in relational tables using SQL standards, there is still one bug to work out. Ingestion with Json will not break out nested tables and requires direct knowledge of data to complete tasks.

This may seem petty but when millions of recods are being read, it clearly is not.

In drill, this could potentially be overcome by creating a table for every single submap we wish to analyze but CREATE TABLE from the tool itself will not work. Instead, it is necessary to limit use cases to the data we want to use.

In PostgreSQL, it is possible to concatenate JSon data using a query whose individual results can then be saved. It is also possible to ‘pop’ keys that are unneeded. However, this approach requires many different tables at one per somewhat normalized form. It also requires recombining data.


SELECT row_to_json(r.*) FROM (SELECT nonJson AS nonJson, ((((data->'level1')::jsonb - 'bkey1')::jsonb - 'bkey2')::jsonb -'bkey3')::jsonb AS jdata FROM table WHERE data::jsonb ?| array['mustHaveKey'] AND data::jsonb ?| array['notHaveKey'] IS FALSE) r

Drill is still much more ahead of the game than Pentaho and PostgreSQL in terms of conversion though. Postgresql can guess types but has no function to attempt to automatically generate tables. Pentaho requires explicit conversion as well.

Of course, if one already knows every key that will be present, this is not a problem. That, however, means more maintenance as it is then impossible to write programs to automatically handle changes to data. Perhaps implicit conversion will happen soon but any argument as to data type issues should really look at the depth of the standards and conform to them instead of complaining.

Quasi Quotes,Serializability, and Scala

Serialization of functions may seem like a security concern. However, it is also a major need. Tools like Java RMI, Spark, and Akka make at least part of their name on this ability. Where flexibility is a need and systems are less vulnerable, distribution of functions are not necessarily a bad thing. The concepts presented here require Scala 2.11 or higher.

A few use cases to get started:

  • Distributing custom functions with pre-written code
  • Creating localization and centrality in systems that may be spread over multiple Linux containers using tools such as Mesos
  • Executing user code in a trusted, secure, non-networked, and isolated environment
  • Tools such as JSFiddle but with Scala

Follow me through reflection because, well, I may need to explain it to someone soon.

Reflection

Scala let’s a programmer manipulate the Abstract Syntax Tree and change elements of code. There is an article on the Scala website about this. The tool here is reflection, the ability of a program to manipulate itself. Programs which use Just In Time compiling are much more easily manipulated than those that do not. The type of reflection done at run time is Runtime Reflection.

Reflection involves the ability to reify (ie. make explicit) otherwise-implicit elements of a program. These elements can be either static program elements like classes, methods, or expressions, or dynamic elements like the current continuation or execution events such as method invocations and field accesses. One usually distinguishes between compile-time and runtime reflection depending on when the reflection process is performed. Compile-time reflection is a powerful way to develop program transformers and generators, while runtime reflection is typically used to adapt the language semantics or to support very late binding between software components. — Heather Miller, Eugene Burmako, Philipp Haller at Scala

Reflection could be useful in building ETL parsing tools, tuning code for operating systems, or changing query builders to fit certain databases.

Now for my specific use case. I need to be able to serialize and pass code to children processes running separate JVMs. Why? To reduce management issues and improve flexibility in software I am writing for work.

Scala Macros

Scala macros look like functions but manipulate the symbol trees. Their uses include such tasks as tuning functions, CSV Parsing and generating data validators. The data validation link provides a nice overview of macros prior to digging into the Scala website. It is possible to build a basic tree from code in this way.

Scala 2.11.0 Macros

In Scala 2.11.0, we create a method and link it to the macro using macro with T coming from generics belonging to the object enclosing the function. This linking allows us to define elements of the macro.

import c.universe._
import scala.reflect.macros

object Enclosure[T]{
    def myFunction[V](T => V) : Enclosure[V] = macro Func_Impl.myFunction[V,T]
}

Our function takes in a type T and produces a type V. It is defined in the Object FuncImpl by myFunction.

We then define the implementation (this is a basic and slight deviation from the Scala website). Here we create a type at runtime. Therefore, our type is weak. The type tag comes from the previously imported c package. We then define a context (usually universal), the expression transforming the data, and the return

object Func_Impl{
    def myFunction(V : c.WeakTypeTag , T : c.WeakTypeTag)  (c : Context) (c.Expr[ T => V): c.Expr[Enclosure[T]] = ...
}

The code here can be reified and used as needed following the explicit structure where data type T is transformed to V.

The code here is meant as an introduction to the Scala Website which goes much more in depth on this subject including how to manipulate symbol trees and types.

Scala 2.11.8 Macros

Scala 2.11 changed Macros quite a bit. There are now whitebox and blackbox macros with whitebox macros having more ambiguous signatures than blackbox macros. The transformations completed by blackbox macros is understood by their input and output types without understanding their inner workings (think of testing code).

Creation of macros changed significantly as well with implementations divided into whitebox and blackbox packages under scala.reflect. Tags have changed as well with changes documented on the scala website.

Quasi Quotes

Quasi quotes take the difficulty out of Scala Macros for our task. They take harder to write code and let us build and use trees from them. We simply write the code, parse it with Scala’s reflection tools, and then return the function through evaluation. The quasi quote effects compilation.

import scala.reflection.tools.ToolBox
import scala.reflect.runtime.universe.{Quasiquote, runtimeMirror}

val f = q"def myFunction(i : Int) : Int = i + 1"
val wrapper = "object FunctionWrapper { " + f + "}"
val symbol = tb.define(tb.parse(wrapper).asInstanceOf[tb.u.ImplDef])

        // Map each element using user specified function
val func = tb.eval("$symbol.f _")
func(1)

//result should be 2

The code here created a quasiquote, generated a wrapper to be parsed, generated the symbol tree, and obtained the function from it.

Serialization

With a cursory view of macros, we can now look at serialization. It is actually quite simple. Serialization here just means that we take a string and serialize it. Classes do not extend the Serializable trait. Any does not extend Serializable either. Therefore, it is necessary to find alternate means to write serialize the code. Some blogs recommend approaches such as shim functions which may better suit your needs. However, String is serializable and, as long as the functions are defined, quasiquotes are useful here. Just ensure that any other libraries requiring linking to are already in the class path.

Serialization is simple.

@SerialVersionUID(10L)
class OurClass (val code : String,val v2 : Int, val v3 : Double) extends Serializable{

    override def toString = s"{code:'$code',v2: '$v2',v3: '$v3'}"
}

Security Mechanisms

This is by no means secure or, in many cases, a bright idea. However, when possible and when the usefulness outweighs the problems, there are some tools to deploy in ensuring a bit of security.

  • Write and pass check sums with a Hamming, Adler, or other function
  • Encrypt transmissions
  • Ensure security (permissions by user modes,passwords, and user names)
  • Isolating environments that execute this sort of code
  • Only running and generating code passed in this manner via internal networks
  • If absolutely necessary, use a VPN

Conclusion

There are certainly more complex cases but this is a starter for reflection, macros, quasi-quotes, and serialization, a way to tie things together. The linked resources should prove useful for more depth.

Is there an Ethical Imperative to Own Certain Domains?

Surfing the internet, I came across this site goole.com. Click this site at your own risk. It brings up an ethical question. Should websites that deal results to a wide range of people and can expect an error or two from time to time from everyone own certain domains? How far should they go?

Please consider the golden rule here. The majority of spelling errors occur within an edit distance of two. Peter Norvig found a value as high as 98.9 percent of errors with roughly eighty percent at one distance unit. A quick review of Levenshtein from code can be found here but it is the total number of transpositions, insertions, and deletions. Apache also produces a version.

Pro Ownership

Let’s consider Google, the obvious target of the above link. If someone was looking for a handout and became angry, they could easily try to turn the site into a virus ridden hell-fest for any unsuspecting victim. People make mistakes. Therefore, it is a decent proposal to at least try and protect the user by owning some of these sites.

Ownership has quite a few pros some of which are more commercial than ethical.

  • Attain credibility by attempting to protect users
  • Acknowledges humanity
  • Alert ill-doers that you take some stance against ill-will
  • Protection from likeness and image issues

Against Ownership

The issue with ownership is that going so far may create an expectation of going even further. If a company such as Google purchased Goole, do they then need to purchase Gogle and Googls at a Levenshtein of 1. What about goilele? Perhaps the user then fails to take matters into their own hand and correct their mistakes. Even worse, what if expectations of a payout follow and failure to do so create more virus ridden Velociraptor. If this were the case, the acknowledgement may even negate some of the pros.

Cons include:

  • Generating expectations of protection creating complacency
  • Creating a drive to use the site to do ill will
  • Generating opportunity to achieve a payout without effort
  • Cost (especially with the availability of domain names)
  • Errors may not matter for a small,specialized, or obscure website

Conclusion

The pros and cons are actually more numerous than mentioned but still interesting in this case. There should be some attempt to protect a site. Not everyone will hit the target page every time. Errors are human. Since most errors are within two Levenshtein distance units, obtaining a large number of sites within this number that seems appropriate to avoid errors by most users is helpful. Goole may be so close to Google that it would make a solid purchase. This weighs the need to protect ones self against errors but why not show an alert or post a blank page instead of a redirect. This approach avoids extreme costs by considering appropriateness, acknowledges human error, and meets most criteria laid out here.