JavaCV Basics: Basic Image Processing

Here, we analyze some of the basic image processing tools in OpenCV and their use in GoatImage.

All code is available on GitHub under GoatImage. To fully understand this article, read the related articles and look at this code.

Select functions are exemplified here. In GoatImage, JavaDocs can be generated further explaining the functions. The functions explained are:

  • Sharpen
  • Contrast
  • Blur

Dilate, rotate, erode, min thresholding, and max thresholding are left to the code. Thresholding in OpenCV is described in depth with graphs and charts via the documentation.

Related Articles:

Basic Processing in Computer Vision

Basic processing is the key to successful recognition. Training sets come in a specific form. Pre-processing is usually required to ensure the accuracy and quality of a program. JavaCV and OpenCV are fast enough to work in a variety of circumstances to improve algorithmic performance at a much lower speed reduction cost. Each transform applied to an image takes time and memory but will pay off handsomely if done correctly.

Kernel Processing

Most of these functions are linear transformations. A linear transformation uses a function to map one matrix to another (Ax = b). In image processing, the matrix kernel is used to do this. Basically a weighted matrix can be used to map a certain point or pixel value.

For an overview of image processing kernels, see wikipedia.

Kernels may be generated in JavaCV.

    * Create a kernel from a double array (write large kernels more understandably)
    * @param kernelArray      The double array of doubles with the kernel values as signed ints
    * @return                 The kernel mat
  def generateKernel(kernelArray: Array[Array[Int]]):Mat={
    val m = if(kernelArray != null) kernelArray.length else 0
    if(m == 0 ){
      throw new IllegalStateException("Your Kernel Array Must be Initialized with values")

    if(kernelArray(0).length != m){
      throw new IllegalStateException("Your Kernel Array Must be Square and not sparse.")

    val kernel = new Mat(m,m,CV_32F,new Scalar(0))
    val ki = kernel.createIndexer().asInstanceOf[FloatIndexer]

    for(i <- 0 until m){
      for(j <- 0 until m){

More reliably, there is a function for generating a Gaussian Kernel.

    * Generate the square gaussian kernel. I think the pattern is a(0,0)=1 a(1,0) = n a(2,0) = n+2i with rows as a(2,1) = a(2,0) * n and adding two to the middle then subtracting.
    * However, there were only two examples on the page I found so do not use that without verification.
    * @param kernelMN    The m and n for our kernel matrix
    * @param sigma       The sigma to multiply by (kernel standard deviation)
    * @return            The resulting kernel matrix
  def generateGaussianKernel(kernelMN : Int, sigma : Double):Mat={

Sharpen with A Cutom Kernel

Applying a kernel in OpenCV can be done with the filter2D method.


Here a sharpening kernel using the function above is applied.

    * Sharpen an image with a standard sharpening kernel.
    * @param image    The image to sharpen
    * @return         A new and sharper image
  def sharpen(image : Image):Image={
    val srcMat = new Mat(image.image)
    val outMat = new Mat(srcMat.rows(),srcMat.cols(),srcMat.`type`())

    val karr : Array[Array[Int]] = Array[Array[Int]](Array(0,-1,0),Array(-1,5,-1),Array(0,-1,0))
    val kernel : Mat = this.generateKernel(karr)
    new Image(new IplImage(outMat),,image.itype)


Contrast kicks up the color intensity in images by equation, equalization, or based on neighboring pixels.

One form of Contrast applies a direct function to an image:

    * Use an equation applied to the pixels to increase contrast. It appears that
    * the majority of the effect occurs from converting back and forth with a very
    * minor impact for the values. However, the impact is softer than with equalizing
    * histograms. Try sharpen as well. The kernel kicks up contrast around edges.
    * (maxIntensity/phi)*(x/(maxIntensity/theta))**0.5
    * @param image                The image to use
    * @param maxIntensity         The maximum intensity (numerator)
    * @param phi                  Phi value to use
    * @param theta                Theta value to use
    * @return
  def contrast(image : Image, maxIntensity : Double, phi : Double = 0.5, theta : Double = 0.5):Image={
    val srcMat = new Mat(image.image)
    val outMat = new Mat(srcMat.rows(),srcMat.cols(),srcMat.`type`())

    val usrcMat = new Mat()
    val dest = new Mat(srcMat.rows(),srcMat.cols(),usrcMat.`type`())

    multiply(dest,(maxIntensity / phi))
    val fm = 1 / Math.pow(maxIntensity / theta,0.5)
    multiply(dest, fm)

    new Image(new IplImage(outMat),,image.itype)

Here the image is manipulated using matrix equations to form a new image where pixel intensities are improved for clarity.

Another form of contrast equalizes the image histogram:

* A form of contrast based around equalizing image histograms.
* @param image The image to equalize
* @return A new Image
def equalizeHistogram(image : Image):Image={
val srcMat = new Mat(image.image)
val outMat = new Mat(srcMat.rows(),srcMat.cols(),srcMat.`type`())
new Image(new IplImage(outMat),,image.itype)

The JavaCV method equalizeHist is used here.


Blurring uses averaging to dull images.

Gaussian blurring uses a Gaussian derived kernel to blur. This kernel uses an averaging function as opposed to equal weighting of neighboring pixels.

    * Perform a Gaussian blur. The larger the kernel the more blurred the image will be.
    * @param image              The image to use
    * @param degree             Strength of the blur
    * @param kernelMN           The kernel height and width should match (for instance 5x5)
    * @param sigma              The sigma to use in generating the matrix
    * @param depth              The depth to use
    * @param brightenFactor     A factor to brighten the result by with  0){
      outImage = this.brighten(outImage,brightenFactor)

A box blur uses a straight kernel to blur, often weighting pixels equally.

    * Perform a box blur and return a new Image. Increasing the factor has a significant impact.
    * This algorithm tends to be overly powerful. It wiped the lines out of my test image.
    * @param image   The Image object
    * @param depth   The depth to use with -1 as default corresponding to image.depth
    * @return        A new Image
  def boxBlur(image : Image,factor: Int = 1,depth : Int = -1):Image={
    val srcMat = new Mat(image.image)
    val outMat = new Mat(srcMat.rows(),srcMat.cols(),srcMat.`type`())

    //build kernel
    val kernel : Mat = this.generateKernel(Array(Array(factor,factor,factor),Array(factor,factor,factor),Array(factor,factor,factor)))

    //apply kernel
    filter2D(srcMat,outMat, depth, kernel)

    new Image(new IplImage(outMat),,image.itype)

Unsharp Masking

Once a blurred Mat is achieved, it is possible to perform an unsharp mask. The unsharp mask brings out certain features by subtracting the blurred image from the original while taking into account an aditional factor.

def unsharpMask(image : Image, kernelMN : Int = 3, sigma : Double = 60,alpha : Double = 1.5, beta : Double= -0.5,gamma : Double = 2.0,brightenFactor : Int = 0):Image={
    val srcMat : Mat = new Mat(image.image)
    val outMat = new Mat(srcMat.rows(),srcMat.cols(),srcMat.`type`())
    val retMat = new Mat(srcMat.rows(),srcMat.cols(),srcMat.`type`())

    //using htese methods allows the matrix kernel size to grow
    GaussianBlur(srcMat,outMat,new Size(kernelMN,kernelMN),sigma)

    var outImage : Image = new Image(new IplImage(outMat),,image.itype)

    if(brightenFactor > 0){
      outImage = this.brighten(outImage,brightenFactor)



This article examined various image processing techniques.

JavaCV Basics: Splitting Objects

Here we put together functions from previous articles to describe a use case where objects are discovered in an image and rotated.

All code is available on GitHub under the GoatImage project.

Related Articles:

Why Split Objects

At times, objects need to be tracked reliably, OCR needs to be broken down to more manageable tasks, or there is another task requiring splitting and rotation. Particularly, recognition and other forms of statistical computing benefit from such standardization.

Splitting allows object by object recognition which may or may not improve accuracy depending on the data used to train an algorithm and even the type of algorithm used. Bayesian based networks, including RNNs, benefit from this task significantly.

Splitting and Rotating

The following function in GoatImage performs contouring to find objects, creates minimum area rect, and finally rotates objects based on their skew angle.

    * Split an image using an existing contouring function. Take each RIO, rotate, and return new Images with the original,
    * @param image              The image to split objects from
    * @param contourType        The contour type to use defaulting to CV_RETR_EXTERNAL
    * @param minBoxArea         Minumum box area to accept (-1 means everything and is default)
    * @param maxBoxArea         Maximum box area to accept (-1 means everything and is default)
    * @param show               Whether or not to show the image. Default is false.
    * @param xPosSort           Whether or not to sort the objects by their x position. Default is true. This is faster than a full sort
    * @return                   A tuple with the original Image and a List of split out Image objects named by the original_itemNumber
  def splitObjects(image : Image, contourType : Int=  CV_RETR_LIST,minBoxArea : Int = -1, maxBoxArea : Int = -1, show : Boolean= false,xPosSort : Boolean = true):(Image,List[(Image,BoundingBox)])={
    val imTup : (Image, List[BoundingBox]) = this.contour(image,contourType)

    var imObjs : List[(Image,BoundingBox)] = List[(Image,BoundingBox)]()

    var boxes : List[BoundingBox] = imTup._2

    //ensure that the boxes are sorted by x position
      boxes = boxes.sortBy(_.x1)

    if(minBoxArea > 0){
        boxes = boxes.filter({x => (x.width * x.height) > minBoxArea})

    if(maxBoxArea > 0){
      boxes = boxes.filter({x => (x.width * x.height) < maxBoxArea})

    //get and rotate objects
    var idx : Int = 0
    for(box <-  boxes){
      val im = this.rotateImage(box.image,box.skewAngle)
        im.showImage(s"My Box ${idx}")
      imObjs = imObjs :+ (im,box)
      idx += 1


Contours are filtered after sorting if desired. For each box, rotation is performed and the resulting image returned as a new Image.


Here the splitObjects function of GoatImage is reviewed, revealing how the library and OpenCV splits and rotates objects as part of standardization for object recognition and OCR.

JavaCV Basics: Cropping

The ROI code is  broken on the JavaCV example site. Here we will look at cropping an image by defining a region of interest. The remaining JavaCV example code should work.

All code is available on GitHub under the GoatImage project.

Related Articles:

Defining an ROI

Setting a Region of Interest (ROI) requires using the cvSetImageROI function which takes an IplImages and a Rect representing the region of interest.

cvSetImageROI(image, rect)

Putting it all Together By Cropping

Cropping takes our ROI and generates a new image fairly directly.

    * Crop an existing image.
    * @param image      The image to crop
    * @param x          The starting x coordinate
    * @param y          The starting y coordinate
    * @param width      The width
    * @param height     The height
    * @return           A new Image
  def crop(image : Image, x : Int, y : Int, width : Int, height : Int): Image={
    val rect = new CvRect(x,y,width,height)
    val uImage : IplImage = image.image.clone()
    cvSetImageROI(uImage, rect)
    new Image(cvCreateImage(cvGetSize(uImage),image.image.depth(),image.image.nChannels()),,image.itype)


Simple cropping was introduced to rectify an issue with the ROI example from JavaCV.

JavaCV Basics: Rotating

Rotating an image is a common task. This article reviews how to rotate a matrix using JavaCV.

These tutorials utilize GoatImage. The Image object used in the code examples comes from this library.

Related Articles:

Rotation Matrix

The rotation matrix  is used to map from one pixel position to another. The matrix, shown below, uses trigonometric functions.


Rotation is a linear transformation. A linear transformation uses a function to map from one matrix to another. In image processing, the matrix kernel is used to perform this mapping.

Rotation in JavaCV 3

Rotation in JavaCV 3 utilizes a generated rotation matrix and the warp affine function. The function getRotationMatrix2D generates a two dimensional matrix using a center Point2f, angle, and a scale.

    * Rotate an image by a specified angle using an affine transformation.
    * @param image      The image to rotate
    * @param angle      The angle to rotate by
    * @return           A rotated Image
  def rotateImage(image : Image,angle : Double):Image={
    val srcMat = new Mat(image.image)
    val outMat = new Mat(srcMat.cols(),srcMat.rows(),srcMat.`type`())
    val cosv = Math.cos(angle)
    val sinv = Math.sin(angle)
    val width = image.image.width
    val height = image.image.height
    val cx = width/2
    val cy = height/2

    //(image.image.width*cosv + image.image.height*sinv, image.image.width*sinv + image.image.height*cosv);
    val rotMat : Mat = getRotationMatrix2D(new Point2f(cx.toInt,cy.toInt),angle,1)

    new Image(new IplImage(outMat),,image.itype)

The Angle

The angle in OpenCV and thus JavaCV is centered at -45 degrees due to the use of vertical contouring. If an image is less than -45 degrees, adding 90 degrees will correct this offset.

val angle = if(minAreaRect.angle < -45.0) minAreaRect.angle + 90 else minAreaRect.angle


In this tutorial we reviewed the function in GoatImage for rotating images using OpenCv. The functions getRotationMatrix2D and warpAffine were introduced. Basic kernel processing was introduced as well.

JavaCV Basics: Contouring

Here, we look at contouring in JavaCV 3. Contouring discovers the boundaries in an image that stand out from the background. It is a key part of object tracking, rotation, and many other tasks. JavaCV and OpenCV allow for the creation of bounding boxes around objects discovered through contouring.

The tutorials utilizes GoatImage.  The Image object is from this library.

All code is available on GitHub under GoatImage.

Related Articles:


Remember contouring in Calc3. The same principal is used in image processing. Contour lines have a constant value. In imaging the value can be of a certain intensity or color. This differs from contouring a shape which uses values such as those obtained from the derivative of an equation.



JavaCV includes functions for finding contours. The MatVector is used to store the discovered contour lines. MatVector in JavaCV is used in place of the CvSeq from OpenCV.

 val chainType : Int = CHAIN_APPROX_SIMPLE //chaining explained below
 val contourType : Int = CV_RETR_LIST //return values described below 
 val contours = new MatVector()
 val hierarchy = new Mat()
 val contImage : IplImage = image.image.clone()
 findContours(new Mat(image.image.clone()), contours,contourType, chainType, new Point(0, 0))

Contour lines can be stored in the MatVector in several forms. These options are:

  • CV_RETR_EXTERNAL : Returns only external contours
  • CV_RETR_LIST : Returns a list of every contour including nested contours
  • CV_RETR_CCOMP : Returns contours organized by inner and outer contours
  • CV_RETR_TREE : Returns a hierarchical ordering of contours nested by tree level

Similarly, several forms of chaining are available with varying effects. Chaining defines the level of approximation used in estimating the points forming the contour line. Approximation types include:

  • CHAIN_APPROX_NONE : Store every point
  • CHAIN_APPROX_SIMPLE : Encode values by storing only the endpoints of an interval and coefficients for the line through the endpoints
  • CHAIN_APPROX_TC89_L1 : Use a variant of the Ten Chin algorithm
  • CHAIN_APPROX_TC89_KCOS : Another variant of Ten Chin

A paper is available describing the Ten Chin algorithm.

Draw Image Contours

JavaCV contains a method for drawing your contours. Specifically, drawContours may be used from opencv_imgproc.

    * A test function that draws the contours found in the Image.
    * @param image
    * @param contourType
    * @return
  def drawImageContours(image : Image, contourType : Int = CV_RETR_LIST):(Image,Image,Long)={
    val dstMat : Mat = new Mat(image.image.clone())
    val srcMat = new Mat(image.image)
    val storage : CvMemStorage =cvCreateMemStorage(0)
    val contours = new MatVector()
    val hierarchy = new Mat()
    findContours(new Mat(image.image.clone()), contours,contourType, CHAIN_APPROX_SIMPLE, new Point(0, 0))
    val colorDst = new Mat(srcMat.size(), CV_8UC3, new Scalar(0))
    drawContours(colorDst, contours, -1  , new Scalar(255,0,0,0))
    (image,new Image(new IplImage(colorDst),,image.itype),contours.size())

Bounding Boxes and Min Area Rectangles

In this tutorial, bounding boxes and minimum area rectangles are obtained through JavaCV specific functions.

The bounding box is an upright Rect. Rect is a JavaCV variant of the cvRect. The bottom left coordinate, width, height, and position are stored in the box. The center is calculable as x + width/2, y + height /2.

val idx : Int  = 0

The minimum bounding rectangle is a Rotated Rect which stores points relative to the actual bounding rectangle. This object has angle and point attributes in addition to relative x and y values and width and height.

val idx : Int = 0

To get a bounding circle, use the circle function ported from OpenCV.

Putting it all Together

The following code obtains contours and a list of bounding boxes.

    * Contour and return the bounding boxes and the slope of the found box. Boxes that overlap
    * will be combined to make a much bigger box.
    * @param image            The image to use in detection.
    * @param contourType      The type of contour to use by default grab only the external RETR_CCOMP returns and organizes inner and outer contours & RETR_LIST gives everything
    * @return           A tuple of the image and a list of tuples with bounding boxes and their bottom line slopes
  def contour(image : Image, contourType : Int =  CV_RETR_LIST):(Image,List[BoundingBox])={
    var minRects : List[BoundingBox] = List[BoundingBox]()
    val total : Int = 0
    val storage : CvMemStorage =cvCreateMemStorage(0)
    val contours = new MatVector()
    val hierarchy = new Mat()
    val contImage : IplImage = image.image.clone()
    findContours(new Mat(image.image.clone()), contours,contourType, CHAIN_APPROX_SIMPLE, new Point(0, 0))

    for(idx <- 0 until contours.size().toInt) {
      val clMat = new Mat()
      new Mat(image.image).copyTo(clMat)
      val clImage : IplImage = new IplImage(clMat)
      val boundRect = minAreaRect(contours.get(idx))
      val bbx = boundRect.boundingRect()
      val cropped : IplImage = cvCreateImage(cvGetSize(clImage),clImage.depth(),clImage.nChannels())
      val angle : Double = if(boundRect.angle() < -45.0) boundRect.angle + 90 else boundRect.angle()
      minRects = minRects :+ new BoundingBox(boundRect.boundingRect().x,boundRect.boundingRect().y,boundRect.boundingRect().width,boundRect.boundingRect().height,angle,new Image(cropped,s"${}_${idx}",image.itype))

    //setup bounding boxes


Here, we looked at contouring in JavaCV. JavaCV functions were used to find contours, draw them, and obtain bounding boxes.

Headless Testing and Scraping with Java FX

There is a lot of JavaScript in the world today and there is a need to get things moving quickly. Whether testing multiple websites or acquiring data for ETL and/or analysis, a tool needs to exist that does not leak memory as much as Selenium. Until recently, Selenium was really the only option for webkit, JCEF and writing native bindings for Chromium have been options for a while. Java 7 and Java 8 have stepped into the void with the JavaFX tools. These tools can be used to automate scraping and testing where network calls for HTML, Json, CSVs, pdfs, or what not are more tedious and difficult.

The FX Package

FX is much better than the television channel with some exceptions. Java created a sleeker version of Chromium based on webkit. While webkit suffers from some serious setbacks, Java FX also incorporates nearly any part of the framework. Setting SSL Handlers, proxies, and the like works the same as with Therefore, FX can be used to intercept traffic (e.g. directly stream images that are incoming to a file named by URL without making more network calls), present a nifty front end controlled by JavaScript and querying for components,


Ui4j is as equally nifty as the FX package. While FX is not capable of going headless without a lot of work, Ui4j takes the work out of such a project using Monocle or Xvfb. Unfortunately, there are some issues getting Monocle to run by setting -Dui4j.headless=true on command line or using system properties after jdk1.8.0_20. Oracle removed Monocle from the jdk after this release and forced the programs using the server to OpenMonocle. However, xvfb-run -a works equally well. The -a option automatically chooses a server number. The github site does claim compatibility with Monocle though.

On top of headless mode, the authors have made working with FX simple. Run JavaScript as needed, incorporate interceptors with ease, run javascript, and avoid nasty waitFor calls and Selanese (this is an entire language within your existing language).


There is an alternative to Ui4j in TestFX. It is geared towards testing. Rather than using an Assert after calling or with ((String) page.executeScript(“document.documentElement.innerHTML”)), methods such as verifyThat exist. Combine with Scala and have a wonderfully compact day. The authors have also managed to get a workaround for the Monocle problem.

Multiple Proxies

The only negative side effect of FX is that multiple instances must be run to use multiple proxies. Java and Scala for that matter set one proxy per JVM. Luckily, both Java and Scala have subprocess modules. The lovely data friendly language that is Scala makes this task as simple as Process(“java -jar myjar.jar -p my:proxy”).!. Simply run the command which returns the exit status and blocks until complete (see Futures to make this a better version of non-blocking) and use tools like Scopt to get the proxy and set it in a new Browser session. Better yet, take a look at my Scala macros article for some tips on loading code from a file (please don’t pass it as command line). RMI would probably be a bit better for large code but it may be possible to better secure a file than compiled code using checksums.


Throw out Selenium, get rid of the extra Selanese parsing and get Ui4J or TestFX for webkit testing. Sadly, it does not work with Gecko so Chromium is needed to replace these tests and obtain such terrific options as –ignore-certificate-errors. There are cases where fonts in the SSL will wreak havoc before you can even handle the incoming text no matter how low level you write your connections. For simple page pulls, stick to Apache HTTP Components which contains a fairly fast, somewhat mid-tier RAM usage asynchronous thread pool useable in Java or Scala. Sorry for the brevity folks but I tried to answer a question or two that was not in tutorials or documentation. Busy!

Mixing Java XML and Scala XML for Maximum Speed

Lets face it. ETL and data tasks are expected to be finished quickly. However, the amount of crud pushed through XML can be overwhelming. XML manipulation itself in java can take a while despite being relatively easy due to the need to loop through every attribute that one wished to create or read from. On the other hand, Scala, despite being a terrifically complete language based on the JVM suffers from some of the issues of a functional language. Only Scales XML offers a way to actually perform in place changes on XML. It is not, however, necessary to download an entirely new dependency on top of the existing Java and Scala libraries already available when using Scala.

Scala XML

There are definitive advantages to Scala XML. Imported under scala.xml._ and an other relevant subpackages, creation of XML is much faster and easier than with Java. All nodes and attributes are created as strings, making picking up Scala XML a breeze.

   val txtVal="myVal"
   val otherTxtVal="Hello World"
   val node = <tag attr={txtVal} attr2="other"> {otherTxtVal} </tag>

In order to generate child nodes, simply create an iterable with the nodes.

   val attrs:List[String] = List("1","2","3") //because we rarely know what the actual values are, these will be mapped
   val node = <property name="list"><list>{<value> li </value>)}</list></property> // a Spring list

Another extremely decent addition that brings speed to Scala XML parsing is the Pretty Printer under scala.xml.PrettyPrinter(width:int,spacer:int)

     val printer = new PrettyPrinter(250,5) //will create lines of width 250 and a tab space of 5 chars at each depth level
     val output = printer.format(xml)

There is one serious issue with Scala XML. Scala XML uses immutable lists. This means that nodes are difficult to append to. Rewrite
Rules and transformations
are available and particularly useful for filtering and deleting
but not necessarily good at appending.

   class Filter extends RewriteRule{
         override def transform(n:Node):Seq[Node] = n match{
            case (n \\ "bean[@id='myId']).length > 0 => NodeSeq.Empty //please bear with me as I am still learning Scala
            case n.Elem(prefix, label, attribs, scope, child @ _*) => NodeSeq.Empty
            case e:Elem => n
            case _ =>{
               throw new RuntimeExceptione

   val newXML = new RuleTransformer(new Filter()).transform(xml) //returns filtered xml

Such code can become cumbersome when writing a ton of appendage rules each with a special case. Looking at the example from the github page makes this abundantly clear. Another drawback is the speed impact a large number of cases can generate since XML is re-written with every iteration. >a href=””>Scales XML overcomes this burden but has a much less Scala-esque API and requires an additional download. Memory will be impacted as well as copies of the original information are stored.

Java Document

Unlike Scala XML, Java already has a built in in-place XML editor available after the release of Java 4. Importing this library is not outside the standard uses of Scala. Perhaps the Scala developers intended for their library to mainly be used in parsing. Hence, there is a mkString within the native library but a developer or programmer must download sbt with sbt.IO for a scala File System API that itself uses Java’s File class. There is also the lack of networking tools leading to third party tools and a Json library not quite excelling at re-writing Json in older Scala versions (non-existant in 2.11.0+). While it would be nice to see this change, there is no guarantee that it will. Instead, java fills in the gaps.

Here the javax.xml libraries come particularly in handy. The parsers and transformers can read and generate strings, making interoperability a breeze. Xpath expressions may not be as simple as in Scala where xml // “xpath” or xml / “xpath” will return the nodes but it is painless.

   import javax.xml.parsers._ //quick and dirty
   import org.w4c.dom._
   import javax.xml._    
   val xmlstr = <tag>My Text Value</tag>
   val doc:Document = DocumentBuilderFactory.newInstance.newDocumentBuilder.parse(new ByteArrayInputStream(xmlstr.getBytes))
   val nodes:NodeSet = xpath.compile("//tag").evaluate(doc,NodeConstants.NodeSet)

With a nodelist, one can getAttributes(), setAttribute(key,value), appendChild(node), insertBefore(nodea,nodeb) and call other useful methods, such as getNextSibling() or getPreviousSibling(), from the API. Just be careful to use the document element when making calls on the nodes where necessary.


Inter-operability between scala and java is quite simple. Use a builder with an input stream obtained from the string to read an XML document to Java and a transformer to read that XML back to something Scala enjoys.

//an example getting a string with Pretty Printer from an existing document
   val xmlstr = <tag>My Text Value</tag>
   val doc: Document = DocumentBuilderFactory.newInstance.newDocumentBuilder.parse(new ByteArrayInputStream(xmlstr.getBytes))
   val printer = new scala.xml.PrettyPrinter(250,5)
   val sw = new
   val xmlString = javax.xml.transform.TransformerFactory.newInstance().newTransformer().transform(new javax.xml.transform.dom.DOMSource(doc),new
   val xml = printer.format()

Mixing Scala and Java in Java Spring

Scala downright rocks at certain things. Java is great in others. How do you mix Scala into an existing Java Spring Maven program without causing trouble? With a few simple tricks that really don’t require much.

Scala uses the JVM and has a matching java collections compatibility. Using these it is possible to update nearly any Java Spring-maven program with Scala classes.

Dependencies for Maven

There is one simple dependency for this to work, the Scala Programming Language. Add the language to your POM as follows.


This will cause the scala language to be included with Maven at runtime.

Wiring your Class with XML

In discovering this, I am using XML due to the need to easily manipulate a class many times. Spring is basically acting as GUI, enforcer, and code reducer.

Class variables are considered as part of the constructor and must include the scala.beans dependency. All variables need the @BeanProperty to work. Furthermore, importing scala.collection.JavaConversions._ will give java collections scala like properties such as the ability to use .foreach().

    import scala.beans._
    import scala.collection.JavaConversions._
    class MyClass{
       var source:String = null
       var fromImports:String = null

This @BeanProperty annotation will create a getter and setter for Spring when the program builds. It is required even when using a constructor argument.

The following xml can be used to wire the bean.

      <property name="source" value="mySource"/>

If the variables are placed in a constructor, wire as such.

  <bean id="Notify" class="com.hygenics.parser.MyClass" parent="imports">
 	<constructor-arg name="source" value="mysource"/>

Morning Joe: Categorizing Text, Which Algorithms are Best in What Situations

There is a ton of buzz around data mining and, as always, many new names being injected into the same topics showing a lack of study. While the buzz exists, knowing when to deploy an algorithm can be tricky. Based on a deep dive into the subject recently and a dire need to program these algorithms in Java, I present a brief overview of tools with benchmarks and examples to hopefully follow later. Basic concepts are presented first followed by some algorithms that I have not really honed yet (benchmarks are not feasible for now but will be incredibly soon).

To read this thoroughly, I highly recommend following the links. This is a starting point and is also a summary of what I have found to now. Basically, the goal is to help avoid the hours it takes to find information on each and do some stumbling by reading this document.

A word of caution, effective matrix based solutions use a large amount of data. Other fuzzy algorithms exist for discovering relate-ability between small sets of items. For strings, there is distance matching such as Jaro-Winkler or Levenshtein with rule based comparisons and lookup tables to minimize error (say between Adama and Obama). Statistics can enhance this process if there is a need to take the best rating. Train a model for discovering whether the hypothesis that two entities distances makes them the same as opposed to the null hypothesis that it does not after filtering out some common issues.

The Matrix

Linear algebra is a critical foundation of text mining. Different elements are thought of as equations When we have different documents or images, each document or image is often considered to form an equation. This equation can then be presented in a matrix, a simple and easy way to get rid of thinking in really complex terms.

You may have heard of the vaulted differential equation. If not, some reading is in order from a site I used in college when there was not time for the book. The basis of a large portion of differential equations can be written in a matrix. This is important due to the eigen vector and eigen value. To be sure these concepts are crucial for solving matrices to find equations that explain a set of models. Drexel Universities eigenfaces tutorial provides a fairly solid understanding of the way a matrix is used in most text mining. However, similarity ratings are used to compare documents rather than a co-variance matrix for most tasks.

The end result of studying these methods is the ability to look under the hood at today’s hottest text mining technologies.

The Vector Space and TF-IDF

Understanding vectors and vector operations is another crucial step to understanding the mining process. A basic vector is a set of points representing a position in these planes . Vectors can be added, subtracted,multiplied, and, most importantly, stuffed in a matrix where their points can be used with basic linear algebra to find relationships.

Vectors have magnitude and distance. The angles and distances can be compared. Note that, while data loss may be present in finding the right magnitude and distance, the units used should be the same (it would be a terrible idea to think in terms of say millisecond-meter-documents-ice cream cones), it provides a sane basis for considering data. It is up to the miner to choose the most representative points for use in the vector.

In text mining, term frequency-inverse document frequency rating is used in many commercial algorithms including search engines. If the name is not enough, it is basically a frequency ratio based on the ratio of individual document . It works best on more than one document and an offset of 0.5 for term frequency helps offset the effect of large documents slightly by bumping up the rating. Inverse document frequency utilizes a logarithm function to ensure that the rating remains between 0 and 1.

Multiply the following equations together to find the result as described by Wikepedia:

Similarity Ratings

No matter what you do, similarity ratings are the key to making the process work. There are a several that can be used. If the data can be represented fairly well, co-variance is an option . However, text data is not that well suited to using co-variance. This is due to varying styles that represent the same object and, most importantly, issues with quantization. Natural language is naturally fuzzy. Therefore, cosines usually offers a much better solution.

This cosines equation takes the product of two vectors or the sum of two vectors and divides by the product of there normalized vectors or the sum of their normalized vectors. It follows from vector algebra. The result is an angle representing the ‘degree’ of similarity. This can be used for comparison.

Word Net, Disambiguation, and Stemming

The process of disambiguation and stemming are crucial to text mining. There are many sentence processing methods as NLTK shows. At their core is WordNet and other dictionaries. WordNet is a freely available graph of an english dictionary. Most tools work with WordNet for finding root words, disambiguation, and cleaning.

Part of Speech or POS tagging is involved in both disambiguation or stemming. Maximum entropy models are used to discover a part of speech based on common usage.

Disambiguation attempts to resolve words with multiple meanings to their most probable meaning. The worst algorithm is original Lesk but involves only the use of WordNet. Accuracy hovers around 50 percent. Simplified Lesk achieves better results. Lesk finds overlapping words and frequencies to determine the best synonym to replace an ambiguous word. The better algorithms try to use clustering bayes for word sense discovery. Cosines may be used to improve Lesk as well.

Stemming reduces words to their roots. Most WordNet tools use existing classifications with POS tagging to achieve this result.

A Note on Regression Models

Lets be certain, prediction is not well suited to categorization. Changes in word choice across a large number of documents and decisions on importance do not always mean the same thing. Therefore, regression models tend to work poorly. The data is not likely continuous as well. Think of writing like a magnetic field with eddy currents. Predicting the effect due to an encounter with these currents is really, really difficult. Basically, run into an eddy current and you are going to have a really, really bad day. That is not to say that an equation can be created that fits most of the data with respect to location of a point, basically a differential equation. It will likely not be generic and be incredibly difficult to find.

Regression works well on continuous and more natural events.

Classification Tree and a Random Forest

Another often poor performer in categorization of text data is the classification tree. They are as good as the number of rules you are willing to create. However, they may be combined with multinomial Bayes for writing that is uniform and professional (say a legal document) to achieve some success. They are particularly useful after filtering data using LSA/HDP or Multinomial Bayes with decisions that work like a bayesian model when thinking about the bigger picture.

Basically, a classification tree uses probabilities within groupings to ascertain an outcome and moving down to the appropriate left or right child node based on a yes or no response to the question do you belong?

This process works well with defined data when there is a good degree of knowledge about a subject (say gene mapping) but text mining often uses fuzzy data with multiple possible meanings and disambiguation is not entirely accurate, Lesks original algorithm only acheived 50 percent accuracy and an LSA model hovers between 80-90%. Improving the quality can be done with multiple trees or possibly by training off of an extremely large set using cosines instead of raw frequency.

There are multiple methods for tree creation, two are random forests and bagging. Bagging takes multiple trees and averages the probabilities for decisions, using this average for their respective nodes. Random forests find random subsets of features, find probabilities based on them, and select the stronger predictor for a node. The latter approach is best with a much larger set of known features. The number of forests is usually the square root of the total number of features.

Again, the features must be known and fairly homogeneous. Text data is often not.

Multinomial Bayesian Classifier

Multinomial Bayesian Classification is a method that classifies data based on the frequency of words in different categories and their probabilities of occurrence. It is fairly straightforward, find the frequencies or train a set of frequencies on a word by word or gram by gram (a gram being an n-pairing and thus n-gram of words), find probabilities by sentence, take the best one.

MNB works well when writing differs starkly, say with subject matters that differ greatly. It is good for tasks such as separating spam from policies and code in html data when large amounts of training data are present.

Clustering with LSA or HDP

Clustering works well when something is known about the data but categorization is not well done manually. Most algorithms avoid affinity propagation which usually uses the square root of total inputs as the number of clusters any way. Matrices are used heavily here as eigen values and eigen vectors derive an equation that can be used to find relate-ability between documents.

LSA uses raw frequencies or more effectively cosines in the same manner as eigen faces to compare vectors. The end result, however, is an equation representing a category. By matrix inversion and multiplication, all elements are compared. In this case each ij entry in the matrix is a cosine or frequency of a word in a document. HDP (hierarchical direchlet process) is similar but attempts to learn more about the results and improve on the process. It takes much longer than LSA and is experiemental.

If trying to discover new information about text or trying to find the best fit of a number of categories, these methods are useful.

Maximum Entropy Models

Maximum entropy models work well on heterogenous data in a manner similar to Bayes. Gensims sentence tagger classifies sentences from non-sentences in this way. The models find entropy using the maxent principle which uses frequencies and likelihood of occurrence to find outcomes. It works quite well with the correct training sets.

If conditional independence is not assumable and nothing is known about a set, this model is useful. Categories should be known beforehand.




Common Resources

Secure Your Data Series: Protecting from SQL Injection

I have been collecting data and building databases for a while. As I do so, I have come across a wide variety of mistakes in thought and code varying from the notion that a captcha works to a failure to adequately protect from SQL injection. Therefore, I am going to publish fixes to what I find here with examples utilizing Groovy or Spring and Java.

First up, the SQL injection attack. Attempts to stop these attacks range from  using JavaScript to setting variables. I recently came across a site that made an attempt to stop these attacks by limiting wildcards and querying with a variable. However, changing the header by resetting the variable actually gave me access to everything on the site in one fell swoop.


A few suggestions follow:

  • Do not check anything with JavaScript, this is too easy to get around by simply rewriting the page.
  • Write rules in the back end to protect from unauthorized access using SQL. Check variable combinations.
  • Eliminate any suspicious non-allowable wildcards and look for key terms like LIKE, ILIKE,%, and = that are common for WHERE clauses. Select statements in SQL/POSTGRESQL follow a form similar to SELECT –data list– FROM –table– WHERE –variable— ILIKE ‘%answer%’
  • Track IP Addresses and web pages using a symbol table (HashMap and the like) to eradicate any attempts to plainly just post to a server when it is not an API. This should be done in addition to any viewstate tracking and I strongly encourage the use of event validation.
  • Use event validation if available or come up with your own.


Most requests occur through a POST request and most of the requests are handled in Java using spring or a format such as aspx with Groovy Grails starting to become popular.

import org.springframework.http.HttpHeaders;
import org.springframework.http.HttpStatus;
import org.springframework.http.ResponseEntity;
import org.springframework.stereotype.Controller;
import org.springframework.web.context.request.RequestAttributes;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestMethod;
import org.springframework.web.bind.annotation.RequestParam;

public class PostController{
      //RequestMethod.GET also exists
      public ResponseEntity newAnswer(@RequestParam(value="stringq", required=true) String stringq,@RequestParam(value="",required=true) boolean wildcard){
           //prep response
           HttPHeaders headers=new HttpHeaders();
           headers.set("Content-Type","text/html; charset=utf-8");
           //get IP
           String ip=requestAttributes.getRequest().getRemoteAddr();

           String data=null;           

           //check for appropriate response page, assumes post-like nature with single url
           //makes a call to a HashMap entity setup at the beginning of the program
           // and accessible throughout the servers existance
               //create variable data here with response JSON or string
               //create innappropriate use message
           return new ResponseEntity(data, headers, HttpStatus.CREATED);

The hashmap in this case is a static hash map. However, for security (even if using package level access modifiers) a file may be the solution. This will add time to the response though.