2019/2020 – The Years of Data Engineering (Opinion)

pipeline

Photo by Flickr user Stuck in Customs/Creative Commons

The new year brings new hopes for all of the hottest technologies built over the past few years. The last few years were filled with visualization tools and frameworks. Tableau is now a household name, Salesforce is a workhorse for analytics, SAS continues to grow through jmp, and small players such as Panoply are well funded.

This underlies an important and missing component in data. Data management tools and frameworks are severely deficient. Many merely perform materialization.

That is changing this year and it means that data engineering will be an important term over the next few years. Automation will become a reality.

The Data Engineering Problem

Data engineers create pipelines. This means automating the handling of data from aggregation and ingestion to the modeling and reporting process.

As they cover the entire pipeline for your data and often implement analytics in a repeatable manner, data engineering is a broad task. Terms such as ETL, ELT, verification, testing, reporting, materialization, standardization, normalization, distributed programing, crontab, Kubernetes, microservices, Docker, Akka, Spark, AWS, REST, Postgres, Kafka, and statistics are commonly slung with ease by data engineers.

Until 2019, integrating systems often meant combing a variety of tools into a cluttered wreck. A company might deploy python scripts for visualization, Vantara (formerly Pentaho) for ETL, use a variety of aggregation tools combined in Kafka, have data warehouses in PostgreSQL, and may even still use Microsoft Excel to Store data.

The typical company spends $4000 – $8000 per employ providing these pipelines. This cost is unacceptable and  can be avoided in the coming years.

ELT will Not Kill Data Engineers

ELT applications promise to get rid of data engineers but that is pure nonsense meant to attract an ignorant investors money:

  • ELT is often performed on data sources that already underwent ETL by the companies it was purchased from such as Axciom, Nasdaq, and TransUnion
  • ELT eats resources in a significant way and often limits its use to small data sets
  • ELT ignores issues related to streaming from surveys and other sources which greatly benefit from the requirements analysis and transformations of ETL
  • ELT is horrible for integration tasks where data standards differ or are not existant
  • You cannot run good AI or build models on poorly or non-standardized data

This means that ETL will continue to be an important part of a data engineers job.

Of course, since data engineers translate business analyst requirements into reality, the job will continue to be secure. Coding may become less important as new products are released but will never go away in the most efficient organizations.

Limits of Python and GoLang Benefits the Data Engineering Stack

Many people point to Python as a means for making data engineers redundant. This is simply false.

Python is limited and GoLang is only about 30 times faster than Python. This means that jvm will rise in popularity with data scientist and even analysts as companies want to make money on the backs of their algorithms. This benefits data engineers who are typically proficient in at least Java or Scala.

Python works for developers, analysts, and data scientists who want to control tools written in a more powerful language such as C++ or Java. Pentaho dabbled in this before being bought by Hitachi. However, being 60 times slower than the jvm and often requiring three times the resources,  it is not an enterprise grade language.

Python does not provide power. It is not great at parallelism and is single threaded. Any language can achieve parallelism. Python uses heavy OS threads to perform anything asynchronously. This is horrendous.

Consider the case of using Python’s Celery versus Akka, a Scala and Java based tool. Celery and Akka perform the same tasks across a distributed system.

Parsing millions of records in celery can quickly eat up more than fifty percent of a typical servers resources with a mere ten processes. RabbitMQ, the messaging framework behind Celery, can only parse 50 million messages per second on a cluster. Depending on the use case, Celery may also require Redis to run effectively. This means that an 18 logical core server with 31 gigabytes of RAM can be severely bogged down simply processing tasks.

Akka, on the other hand, is the basis for Apache Spark. It is lightweight and all inclusive. 50 million messages per second is attainable  with 10 million actors running concurrently at much less than fifty percent of a typical servers resources. With not every use case requiring spark, even in data engineering, this is an outstanding difference. Not requiring a message routing and results backend means that less skill is required for deployment as well.

The Rise, Fall, and Rise of Scala and Streamlining

When I started programming in Scala, the language was fairly unheard of. Many co-workers merely looked at this potent language as a curiosity. Eventually, Scala’s popularity started to wain as java developers were still focused on websites and ignored creating the same frameworks for Scala that exist in Python.

That is changing. With the rise of R, whose syntax is incredibly similar to Scala, mathematicians and analysts are becoming incredibly used to programming in increasingly complex languages.

Perhaps due to this, Scala is making it back into the lexicon of developers. The power of Python was greatly reduced in 2017 as non-existent or previously non-production level tools were released for the jvm.

Consider what is now at least version 1.0 in Scala:

  • Nd4j and Nd4s: A Scala and Java based non-dimensional array framework that boasts speeds faster than Numpy
  • Dl4J: Skymind is a terrific company producing tools comparable to torch
  • Tensor Flow: Contains APIs for both Java and Scala
  • Neanderthal: A clojure based linear algebra system that is blazing fast
  • OpenNLP: A new framework that, unlike the Stanford NLP tools, is actively developed and includes named entity recognition and other powerful transformative tools
  • Bytedeco: This project is filled with angels (I actually think they came from heaven) whose innovative and nearly automated JNI creator has created links to everything from Python code to Torch, libpostal, and OpenCV
  • Akka: Lightbend continues to produce distributed tools for Scala with now open sourced split brain resolvers that go well beyond my majority resolver
  • MongoDB connectors: Python’s MongoDB connectors are resource intensive due to the rather terrible nature of Python byte code
  • Spring Boot: Scala and Java are interoperable but benchmarks of Spring Boot show at least a 10000 request per second improvement over Django
  • Apereo CAS: A single sign on system that adds terrific security to disparate applications

Many of these frameworks are available in Java.  Since Scala runs any Java programs, the languages are interoperable. Scala is cleaner, functional, highly modular, and requires much less code than Java which puts these tools in the reach of analysts.

What does this mean for a data engineer?

It means attaining 1000 times the speed on a single machine with significant cost reduction and up to a 33 percent code reduction over Java. Some developers report a 20 percent speed boost over Java but that is likely due to poor coding practices.

It also means moving from millions of moving parts to a steadfast system.

The result is clear. My own company is switching off of Python everywhere except for our responsive and front end heavy web application for a fifty percent cost reduction in hardware.

Putting it all Together to Unclutter a Mess

With everything that Scala and the jvm offers, Data Engineers now have a potent tool for automation. These valuable employees may not be creating the algorithms but they will be transforming data in smart ways that produce real value.

Companies no longer have to rely on archaic languages to produce messy systems and this will translate directly into value. Data engineers will be behind this increase in value as they can more easily combine tools into a coherent and flexible whole.

Conclusion

The continued rise of jvm backed tools starting in 2018 will make data pipeline automation a significant part of a companies IT cost. Data engineers will be behind the evolution of data pipelines from disparate systems to a streamlined whole backed by custom code and new products.

2019 and 2020 will be the years of data engineering. After this, we may just be seeing the creation of skynet.

Akka: An Introduction

Akkas documentation is immense. This series helps tackle the many components by providing a working example of the master slave design pattern built with this powerful tool. The following article reviews the higher level concepts behind Akka and its usage.

Links are provided to different parts of the Akka documentation throughout the article.

See also:

Akka

Akka is a software tool  used to build multi-threaded and distributed systems based on the actor model. It takes care of lower level systems building by providing high level APIs for node and actor generation.

Actors are the primitives behind Akka. They are useful for performing repeated tasks concurrently.  Actors run until terminated, receiving work through message passing.

actorsys

Resource Usage

Akka is extremely lightweight. The creators boast that the tool can handle thousands of actors on a single machine.

Message passing occurs through mailboxes. The maximum number of messages a mailbox holds is configurable with a 1000 messages default but messages must be under one megabyte.

The Actor

The actor is the universal primitive used in Akka. Unlike when using threading in a program language, this primitive runs like a daemon server. As such, it should be shut down gracefully.

Actors are user created.

class MyActor extends Actor with ActorLogging{

     override def preStart()= log.debug("Starting")
     override def postStop()= log.debug("Stopping")
     override def preRestart(reason: Throwable, message: Option[Any]) = log.error(s"Restarting because of ${reason.message}. ${message}")     
     override def postRestart(reason : Throwable) = 

     override def receive():Receive={
         case _ => sender ! "Hello from Actor"
     }
}

object MyActor{
   def setupMyActor()={
        val conf = ConfigFactory.load()
        val system = ActorSystem("MySystem",conf)
        val actor : ActorRef = system.actorOf(Props[MyActor],name = "myactor") 
   }
}

 

The example above creates an actor and a Scala companion class for instantiation.

Actors must extend Actor. ActorLogging provides the log library. The optional functions preRestart and postRestart handle exceptions, while the optional preStart and postStop methods handle setup and tear down tasks. The basic actor above incorporates logging and error processing.

An actor can:

  • Create and supervise other actors
  • Maintain a State
  • Control the flow of work in a system
  • Perform a unit of work on request or repeatably
  • Send and receive messages
  • Return the results of a computation

Akka’s serialization is extremely powerful. Anything available across a cluster or on the classpath that implements Serializable can be sent to and from an actor. Instances of classes are de-serialized without having the programmer recast them.

When to Use Akka

Actor systems are not a universal solution. When not performing repeated tasks and not benefiting from high levels of concurrency, they are a hindrance.

State persistence also weighs heavily in the use of an actor system. Take the example of cookies in network requests. Maintaining different network sessions across different remote actors can be highly beneficial in ingestion.

Any task provided to an actor should contain very little new code and a limited number of configuration variables.

Questions that should be asked based on these concepts include:

  • Can I break up tasks into sufficiently large quantities of work to benefit from concurrency?
  • How often will  tasks be repeated in the system?
  • How minimal can I make the configuration for the actor if necessary?
  • Is there a need for state persistence?
  • Is there a significant need for concurrency or is it a nice thought?
  • Is there a resource constraint that distribution can solve or that will limit threading?

State is definitely a large reason to use Akka. This could be in the form of actually  maintaining variables or in the actor itself.

In some distributed use cases involving the processing of enormous numbers of short lived requests, the actors own state and Akka’s mailbox capabilities are what is most important. This is the reasoning behind tools built on Akka such as Spark.

As is always the case when deciding to create a new system, the following should be asked as well:

  • Can I use an existing tools such as Spark or Tensor Flow?
  • Does the time required to build the system outweigh the overall benefit it will provide?

Clustering

Clustering is available in Akka. Akka provides high level APIs for generating distributed clusters. Specified seed nodes handle communications, serving as the system’s entry point.

Network design is provided entirely by the developer. Since node generation, logging, fault tolerance, and basic communication are the only pieces of a distributed system Akka handle’s, any distribution model will suffice. Two common models are the master-slave and graph based models.

Akka clusters are resilient with well developed fault tolerance.

Configuration

Configuration is performed either through a file or in a class. Many components can be configured including logging levels, cluster components, metrics collection, and routers.

Conclusion

This article is the entry point for the Akka series, providing the basic understanding needed as we begin to build a cluster.

Sbt Pack With Xerial

Adding Jars to a classpath should not be a chore. Often, using retrieveManaged in an SBT build is not quite what we want. When dealing with more than a few minor dependencies, having each dependency placed in its own folder is problematic. This article discusses a solution to this issue using Xerial’s sbt-pack plugin.

 

Xerial Pack

Xerial offers a plugin that will package all jars in a single folder and create a bat for executing the configurable main class. This allows for every jar to be placed on the classpath without listing every folder. It also creates a single directory for all dependencies.

Simply place the following in project/plugins.sbt:

addSbtPlugin("org.xerial.sbt" % "sbt-pack" % "0.8.2")  // for sbt-0.13.x or higher

Then specify the packaging options in build.sbt:

packAutoSettings

In this instance, the main class will be automatically found. More options are discussed at the Xerial Github page.

Packaging the jars requires a single command:

sbt pack

Conclusion

Using the classpath with many dependencies does not need to be a chore. Simply import the Xerial plugin and run sbt pack

Enriching Scala With Implicits

Imagine you are an ETL developer using Spring Cloud Data Flow. Nothing is really available for distributed systems and streaming ETL that is as powerful as this tool. Alteryx and Pentaho are at least a year away from pushing out anything as capable. While Pentaho might work, there are just too many holes to fill.

However, you could do with a more compact code language than Java when programming for Spring. A powerful solution is to combine the Spring ecosystem with Scala, using implicits to eliminate redundant code.

This article focuses on using the enrichment pattern in Scala code through the IterableLike library and the concept of the implicit.

 

clock

Time is Money

Implicits

Implicits allow code that is in scope to be utilized when variables are not defined. Only one implicit of a type may be defined within a class:

implicit val myImplicitString : String = "hello there"

def printHello(str : String)={
   println(str)//should print "hello there"
}

This class creates an implicit string and utilizes it in the method printHello. Having two implicit strings confuses the compiler and causes it to crash.

Implicits create the possibility of using the enrichment pattern as described in the next to last section by attaching an implicit definition to a library when a non-existing function is called.

Remap An Object

Our enrichment example contains a method that removes an item from an IterableLike object only when a specific condition holds:

class IterableScalaFuncs[A,Repr](xs : IterableLike[A,Repr]){

    /**
      * Remove an object by a single property value.
      * @param f              The thunk to use
      * @param cbf            The CanBuildFrom to get a builder which should not be touched
      * @tparam That          The result type
      * @return That which should just be of type A
    */
    def removeObjectMatching[That](f : A => Boolean)(implicit cbf : CanBuildFrom[Repr,A,That]):That={
        val builder = cbf(xs.repr)
        val it = xs.iterator

        while(it.hasNext){
            val o = it.next()
            if(!f(o)){
                builder += o
            }
        }
        builder.result()
    }

}

IterableLikeScalaFuncs contains a removeObjectMatching method that takes the result as a type parameter, the thunk to match with in the parameter list, and implicitly connects the existing CanBuildFrom from IterableLike in the net parameter list. It then creates a builder of our type and proceeds to populate it with objects not matching the thunk before returning a new collection with the appropriate items removed.

Enrichment Pattern

The enrichment pattern in Scala embellishes libraries by appending code to them as opposed to creating a wrapper class which must be instantiated. This requires implicitly attaching method definitions to existing libraries but allows Scala to be easily used from the console and in classes.

The class in the previous section can be attached to all IterableLike collections implicitly:

/**
  * The enrichment for Iterables wrapping IterablLike with IterableScalaFuncs
  * @param xs       Our IterableLike object
  * @tparam A       The type of the Iterable
  * @tparam Repr    Traversable Repr
*/
implicit def enrichIterable[A, Repr](xs: IterableLike[A, Repr]) = new IterableScalaFuncs(xs)

The method enrichIterable attaches to the target collections and is used as follows:

import ScalaFuncImplicits._
val list : List[(Int,Int)] = List[(Int,Int)]((1,2),(2,3)).removeObjectMatching(_.1 == 1) //should product List((2,3))

Conclusion

This article reviewed the power of Scala Implicits to reduce redundant code without accounting for every type. The enrichment pattern can be used to easily integrate methods with existing libraries.

Code is available at Github

A Quick Note on SBT Packaging: Packaging Library Dependencies, Skipping Tests, and Publishing Fat Jars

Sbt is a powerful build tool for recurring builds using in a nearly automated way with the tilde to simplifying the Maven build system.

There is the matter of slow build times with assembly. Avoiding slow builds with sbt assembly and automating the process with sbt ~ package is simple. However, using the output code requires having dependencies.

Even then, compiling a fat JAR for use across an organization may be necessary in some circumstances such as when using native libraries. This can avoid costly build problems.

This article reviews packaging dependencies with sbt assembly, offers tips for avoiding tests in all of sbt, and provides an overview of publishing fat JARS with assembly using the sbt commands publish and publish-local.

Related Articles:

Packaging Dependencies

Getting library dependencies to package alongside your jar is simple. Add the following code to build.sbt

retrieveManaged := true

The jars will be under root at /lib_managed.

Skipping Tests

If you want to avoid testing as well, add:

test in assembly := {} //your static test class array

This tells assembly not to run any tests by passing a dynamic structure with no class name elements.

Publishing An Assembled Fat JAR

There may be a need to publish a final fat JAR locally or to a remote repository such as Nexus. This may create a merge in other assemblies or with other dependencies so care should be exercised.

The following code, direcectly from the sbt Github repository, pushes an assembled jar to a repository:

artifact in (Compile, assembly) := {
  val art = (artifact in (Compile, assembly)).value
  art.copy(`classifier` = Some("assembly"))
}

addArtifact(artifact in (Compile, assembly), assembly)

The sbt publish and sbt publish-local commands should now push a fat JAR to the target repository.

Conclusion

Avoiding running assembly constantly and publishing fat JARS requires a simple change to build.sbt. With them, it is possible to obtain dependencies for running on the classpath, avoid running tests on every assembly, and publishing fat JARS.

Faster Compilation of Scala Jars

Compiling Scala code is a pain. For testing and even deployment, there is a much faster method to deploy Scala code. Consider a recent project I am working on. The addition of three large fat jars and OpenCV caused a compile time of roughly ten minutes. Obviously, that is not production quality. As much as it is fun to check how the president is causing stocks to roller coaster when he speaks or yell explitives around co-workers, it is probably not desirable.

This article reviews continuous packaging with sbt. The sbt assembly project is not covered here as in other articles in the series.

Related Articles:

Repeated Compiling and Packaging

Sbt offers an extremely easy way to continuously package sources. Use it.

sbt ~ package

The tilde works for any command and forces SBT to wait for a source change. For instance, ~ compile has the same effect as ~ package but runs the compile command instead of package.

This spins up a check for source file changes and packages sources whenever changes are discovered. Pressing enter will exit the process.

Running the Packaged Sources

We were using sbt assembly for standalone jars. In test, this is fast becoming a bad idea. Instead, place a large dependency jar or multiple smaller dependency jars in a folder and add them to the class path. The following code took compilation from 10 minutes to under one. Packaging takes mere seconds but a full update will take longer.

$ sbt clean
$ sbt package
$ java -cp ".:/path/to/my/jars/*" package.path.to.main.Main [arguments]

This set of commands cleans, packages, and then runs the Main class with your arguments.

This works miracles. The classpath is system dependent, a colon separates entries in Linux and a semi-colon separates entries in Windows.

It is possible to avoid building fat jars using this method. This greatly reduces packaging times which can grow to minutes. In combination with a continuous build process, it takes a fraction of the time to run Scala code as when using the assembly command.

Make Sure You Use the Right Jars

When moving to running with the classpath command and continual packaging in Scala, any merge strategy will disappear. To avoid this, try creating an all in one jar and only placing this in your classpath alongside any jar containing your main class.

Conclusion

This article reviewed a way to more efficiently package and run scala jars without waiting for assembly. The method here works well in test or in a well defined environment.

JavaCV Basics: Basic Image Processing

Here, we analyze some of the basic image processing tools in OpenCV and their use in GoatImage.

All code is available on GitHub under GoatImage. To fully understand this article, read the related articles and look at this code.

Select functions are exemplified here. In GoatImage, JavaDocs can be generated further explaining the functions. The functions explained are:

  • Sharpen
  • Contrast
  • Blur

Dilate, rotate, erode, min thresholding, and max thresholding are left to the code. Thresholding in OpenCV is described in depth with graphs and charts via the documentation.

Related Articles:



Basic Processing in Computer Vision

Basic processing is the key to successful recognition. Training sets come in a specific form. Pre-processing is usually required to ensure the accuracy and quality of a program. JavaCV and OpenCV are fast enough to work in a variety of circumstances to improve algorithmic performance at a much lower speed reduction cost. Each transform applied to an image takes time and memory but will pay off handsomely if done correctly.

Kernel Processing

Most of these functions are linear transformations. A linear transformation uses a function to map one matrix to another (Ax = b). In image processing, the matrix kernel is used to do this. Basically a weighted matrix can be used to map a certain point or pixel value.

For an overview of image processing kernels, see wikipedia.

Kernels may be generated in JavaCV.


  /**
    * Create a kernel from a double array (write large kernels more understandably)
    * @param kernelArray      The double array of doubles with the kernel values as signed ints
    * @return                 The kernel mat
    */
  def generateKernel(kernelArray: Array[Array[Int]]):Mat={
    val m = if(kernelArray != null) kernelArray.length else 0
    if(m == 0 ){
      throw new IllegalStateException("Your Kernel Array Must be Initialized with values")
    }

    if(kernelArray(0).length != m){
      throw new IllegalStateException("Your Kernel Array Must be Square and not sparse.")
    }

    val kernel = new Mat(m,m,CV_32F,new Scalar(0))
    val ki = kernel.createIndexer().asInstanceOf[FloatIndexer]

    for(i <- 0 until m){
      for(j <- 0 until m){
        ki.put(i,j,kernelArray(i)(j))
      }
    }
    kernel
  }

More reliably, there is a function for generating a Gaussian Kernel.

/**
    * Generate the square gaussian kernel. I think the pattern is a(0,0)=1 a(1,0) = n a(2,0) = n+2i with rows as a(2,1) = a(2,0) * n and adding two to the middle then subtracting.
    * However, there were only two examples on the page I found so do not use that without verification.
    *
    * @param kernelMN    The m and n for our kernel matrix
    * @param sigma       The sigma to multiply by (kernel standard deviation)
    * @return            The resulting kernel matrix
    */
  def generateGaussianKernel(kernelMN : Int, sigma : Double):Mat={
    getGaussianKernel(kernelMN,sigma)
  }

Sharpen with A Cutom Kernel

Applying a kernel in OpenCV can be done with the filter2D method.

filter2D(srcMat,outMat,srcMat.depth(),kernel)

Here a sharpening kernel using the function above is applied.

/**
    * Sharpen an image with a standard sharpening kernel.
    * @param image    The image to sharpen
    * @return         A new and sharper image
    */
  def sharpen(image : Image):Image={
    val srcMat = new Mat(image.image)
    val outMat = new Mat(srcMat.rows(),srcMat.cols(),srcMat.`type`())

    val karr : Array[Array[Int]] = Array[Array[Int]](Array(0,-1,0),Array(-1,5,-1),Array(0,-1,0))
    val kernel : Mat = this.generateKernel(karr)
    filter2D(srcMat,outMat,srcMat.depth(),kernel)
    new Image(new IplImage(outMat),image.name,image.itype)
  }

Contrast

Contrast kicks up the color intensity in images by equation, equalization, or based on neighboring pixels.

One form of Contrast applies a direct function to an image:

/**
    * Use an equation applied to the pixels to increase contrast. It appears that
    * the majority of the effect occurs from converting back and forth with a very
    * minor impact for the values. However, the impact is softer than with equalizing
    * histograms. Try sharpen as well. The kernel kicks up contrast around edges.
    *
    * (maxIntensity/phi)*(x/(maxIntensity/theta))**0.5
    *
    * @param image                The image to use
    * @param maxIntensity         The maximum intensity (numerator)
    * @param phi                  Phi value to use
    * @param theta                Theta value to use
    * @return
    */
  def contrast(image : Image, maxIntensity : Double, phi : Double = 0.5, theta : Double = 0.5):Image={
    val srcMat = new Mat(image.image)
    val outMat = new Mat(srcMat.rows(),srcMat.cols(),srcMat.`type`())

    val usrcMat = new Mat()
    val dest = new Mat(srcMat.rows(),srcMat.cols(),usrcMat.`type`())
    srcMat.convertTo(usrcMat,CV_32F,1/255.0,0)

    pow(usrcMat,0.5,dest)
    multiply(dest,(maxIntensity / phi))
    val fm = 1 / Math.pow(maxIntensity / theta,0.5)
    multiply(dest, fm)
    dest.convertTo(outMat,CV_8U,255.0,0)

    new Image(new IplImage(outMat),image.name,image.itype)
  }

Here the image is manipulated using matrix equations to form a new image where pixel intensities are improved for clarity.

Another form of contrast equalizes the image histogram:

/**
* A form of contrast based around equalizing image histograms.
*
* @param image The image to equalize
* @return A new Image
*/
def equalizeHistogram(image : Image):Image={
val srcMat = new Mat(image.image)
val outMat = new Mat(srcMat.rows(),srcMat.cols(),srcMat.`type`())
equalizeHist(srcMat,outMat)
new Image(new IplImage(outMat),image.name,image.itype)
}

The JavaCV method equalizeHist is used here.

Blur

Blurring uses averaging to dull images.

Gaussian blurring uses a Gaussian derived kernel to blur. This kernel uses an averaging function as opposed to equal weighting of neighboring pixels.

 /**
    * Perform a Gaussian blur. The larger the kernel the more blurred the image will be.
    *
    * @param image              The image to use
    * @param degree             Strength of the blur
    * @param kernelMN           The kernel height and width should match (for instance 5x5)
    * @param sigma              The sigma to use in generating the matrix
    * @param depth              The depth to use
    * @param brightenFactor     A factor to brighten the result by with  0){
      outImage = this.brighten(outImage,brightenFactor)
    }
    outImage
  }

A box blur uses a straight kernel to blur, often weighting pixels equally.

 /**
    * Perform a box blur and return a new Image. Increasing the factor has a significant impact.
    * This algorithm tends to be overly powerful. It wiped the lines out of my test image.
    *
    * @param image   The Image object
    * @param depth   The depth to use with -1 as default corresponding to image.depth
    * @return        A new Image
    */
  def boxBlur(image : Image,factor: Int = 1,depth : Int = -1):Image={
    val srcMat = new Mat(image.image)
    val outMat = new Mat(srcMat.rows(),srcMat.cols(),srcMat.`type`())

    //build kernel
    val kernel : Mat = this.generateKernel(Array(Array(factor,factor,factor),Array(factor,factor,factor),Array(factor,factor,factor)))
    divide(kernel,9.0)

    //apply kernel
    filter2D(srcMat,outMat, depth, kernel)

    new Image(new IplImage(outMat),image.name,image.itype)
  }

Unsharp Masking

Once a blurred Mat is achieved, it is possible to perform an unsharp mask. The unsharp mask brings out certain features by subtracting the blurred image from the original while taking into account an aditional factor.

def unsharpMask(image : Image, kernelMN : Int = 3, sigma : Double = 60,alpha : Double = 1.5, beta : Double= -0.5,gamma : Double = 2.0,brightenFactor : Int = 0):Image={
    val srcMat : Mat = new Mat(image.image)
    val outMat = new Mat(srcMat.rows(),srcMat.cols(),srcMat.`type`())
    val retMat = new Mat(srcMat.rows(),srcMat.cols(),srcMat.`type`())

    //using htese methods allows the matrix kernel size to grow
    GaussianBlur(srcMat,outMat,new Size(kernelMN,kernelMN),sigma)
    addWeighted(srcMat,alpha,outMat,beta,gamma,retMat)

    var outImage : Image = new Image(new IplImage(outMat),image.name,image.itype)

    if(brightenFactor > 0){
      outImage = this.brighten(outImage,brightenFactor)
    }

    outImage
  }

Conclusion

This article examined various image processing techniques.

JavaCV Basics: Splitting Objects

Here we put together functions from previous articles to describe a use case where objects are discovered in an image and rotated.

All code is available on GitHub under the GoatImage project.

Related Articles:

Why Split Objects

At times, objects need to be tracked reliably, OCR needs to be broken down to more manageable tasks, or there is another task requiring splitting and rotation. Particularly, recognition and other forms of statistical computing benefit from such standardization.

Splitting allows object by object recognition which may or may not improve accuracy depending on the data used to train an algorithm and even the type of algorithm used. Bayesian based networks, including RNNs, benefit from this task significantly.

Splitting and Rotating

The following function in GoatImage performs contouring to find objects, creates minimum area rect, and finally rotates objects based on their skew angle.

/**
    * Split an image using an existing contouring function. Take each RIO, rotate, and return new Images with the original,
    *
    * @param image              The image to split objects from
    * @param contourType        The contour type to use defaulting to CV_RETR_EXTERNAL
    * @param minBoxArea         Minumum box area to accept (-1 means everything and is default)
    * @param maxBoxArea         Maximum box area to accept (-1 means everything and is default)
    * @param show               Whether or not to show the image. Default is false.
    * @param xPosSort           Whether or not to sort the objects by their x position. Default is true. This is faster than a full sort
    * @return                   A tuple with the original Image and a List of split out Image objects named by the original_itemNumber
    */
  def splitObjects(image : Image, contourType : Int=  CV_RETR_LIST,minBoxArea : Int = -1, maxBoxArea : Int = -1, show : Boolean= false,xPosSort : Boolean = true):(Image,List[(Image,BoundingBox)])={
    //contour
    val imTup : (Image, List[BoundingBox]) = this.contour(image,contourType)

    var imObjs : List[(Image,BoundingBox)] = List[(Image,BoundingBox)]()

    var boxes : List[BoundingBox] = imTup._2

    //ensure that the boxes are sorted by x position
    if(xPosSort){
      boxes = boxes.sortBy(_.x1)
    }

    if(minBoxArea > 0){
        boxes = boxes.filter({x => (x.width * x.height) > minBoxArea})
    }

    if(maxBoxArea > 0){
      boxes = boxes.filter({x => (x.width * x.height) < maxBoxArea})
    }

    //get and rotate objects
    var idx : Int = 0
    for(box <-  boxes){
      println(box)
      val im = this.rotateImage(box.image,box.skewAngle)
      if(show){
        im.showImage(s"My Box ${idx}")
      }
      im.setName(im.getName().replaceAll("\\..*","")+s"_${idx}."+im.itype.toString.toLowerCase())
      imObjs = imObjs :+ (im,box)
      idx += 1
    }

    (image,imObjs)
  }

Contours are filtered after sorting if desired. For each box, rotation is performed and the resulting image returned as a new Image.

Conclusion

Here the splitObjects function of GoatImage is reviewed, revealing how the library and OpenCV splits and rotates objects as part of standardization for object recognition and OCR.

JavaCV Basics: Cropping

The ROI code is  broken on the JavaCV example site. Here we will look at cropping an image by defining a region of interest. The remaining JavaCV example code should work.

All code is available on GitHub under the GoatImage project.

Related Articles:

Defining an ROI

Setting a Region of Interest (ROI) requires using the cvSetImageROI function which takes an IplImages and a Rect representing the region of interest.

cvSetImageROI(image, rect)

Putting it all Together By Cropping

Cropping takes our ROI and generates a new image fairly directly.

/**
    * Crop an existing image.
    *
    * @param image      The image to crop
    * @param x          The starting x coordinate
    * @param y          The starting y coordinate
    * @param width      The width
    * @param height     The height
    * @return           A new Image
    */
  def crop(image : Image, x : Int, y : Int, width : Int, height : Int): Image={
    val rect = new CvRect(x,y,width,height)
    val uImage : IplImage = image.image.clone()
    cvSetImageROI(uImage, rect)
    new Image(cvCreateImage(cvGetSize(uImage),image.image.depth(),image.image.nChannels()),image.name,image.itype)
  }

Conclusion

Simple cropping was introduced to rectify an issue with the ROI example from JavaCV.

JavaCV Basics: Rotating

Rotating an image is a common task. This article reviews how to rotate a matrix using JavaCV.

These tutorials utilize GoatImage. The Image object used in the code examples comes from this library.

Related Articles:

Rotation Matrix

The rotation matrix  is used to map from one pixel position to another. The matrix, shown below, uses trigonometric functions.

rotation

Rotation is a linear transformation. A linear transformation uses a function to map from one matrix to another. In image processing, the matrix kernel is used to perform this mapping.

Rotation in JavaCV 3

Rotation in JavaCV 3 utilizes a generated rotation matrix and the warp affine function. The function getRotationMatrix2D generates a two dimensional matrix using a center Point2f, angle, and a scale.

/**
    * Rotate an image by a specified angle using an affine transformation.
    *
    * @param image      The image to rotate
    * @param angle      The angle to rotate by
    * @return           A rotated Image
    */
  def rotateImage(image : Image,angle : Double):Image={
    val srcMat = new Mat(image.image)
    val outMat = new Mat(srcMat.cols(),srcMat.rows(),srcMat.`type`())
    val cosv = Math.cos(angle)
    val sinv = Math.sin(angle)
    val width = image.image.width
    val height = image.image.height
    val cx = width/2
    val cy = height/2

    //(image.image.width*cosv + image.image.height*sinv, image.image.width*sinv + image.image.height*cosv);
    val rotMat : Mat = getRotationMatrix2D(new Point2f(cx.toInt,cy.toInt),angle,1)

    warpAffine(srcMat,outMat,rotMat,srcMat.size())
    new Image(new IplImage(outMat),image.name,image.itype)
  }

The Angle

The angle in OpenCV and thus JavaCV is centered at -45 degrees due to the use of vertical contouring. If an image is less than -45 degrees, adding 90 degrees will correct this offset.

val angle = if(minAreaRect.angle < -45.0) minAreaRect.angle + 90 else minAreaRect.angle

Conclusion

In this tutorial we reviewed the function in GoatImage for rotating images using OpenCv. The functions getRotationMatrix2D and warpAffine were introduced. Basic kernel processing was introduced as well.