Mixing Java XML and Scala XML for Maximum Speed

Lets face it. ETL and data tasks are expected to be finished quickly. However, the amount of crud pushed through XML can be overwhelming. XML manipulation itself in java can take a while despite being relatively easy due to the need to loop through every attribute that one wished to create or read from. On the other hand, Scala, despite being a terrifically complete language based on the JVM suffers from some of the issues of a functional language. Only Scales XML offers a way to actually perform in place changes on XML. It is not, however, necessary to download an entirely new dependency on top of the existing Java and Scala libraries already available when using Scala.

Scala XML

There are definitive advantages to Scala XML. Imported under scala.xml._ and an other relevant subpackages, creation of XML is much faster and easier than with Java. All nodes and attributes are created as strings, making picking up Scala XML a breeze.

   
   val txtVal="myVal"
   val otherTxtVal="Hello World"
   val node = <tag attr={txtVal} attr2="other"> {otherTxtVal} </tag>

In order to generate child nodes, simply create an iterable with the nodes.

   val attrs:List[String] = List("1","2","3") //because we rarely know what the actual values are, these will be mapped
   val node = <property name="list"><list>{attrs.map(<value> li </value>)}</list></property> // a Spring list

Another extremely decent addition that brings speed to Scala XML parsing is the Pretty Printer under scala.xml.PrettyPrinter(width:int,spacer:int)

     val printer = new PrettyPrinter(250,5) //will create lines of width 250 and a tab space of 5 chars at each depth level
     val output = printer.format(xml)

There is one serious issue with Scala XML. Scala XML uses immutable lists. This means that nodes are difficult to append to. Rewrite
Rules and transformations
are available and particularly useful for filtering and deleting
but not necessarily good at appending.

   class Filter extends RewriteRule{
         override def transform(n:Node):Seq[Node] = n match{
            case (n \\ "bean[@id='myId']).length > 0 => NodeSeq.Empty //please bear with me as I am still learning Scala
            case n.Elem(prefix, label, attribs, scope, child @ _*) => NodeSeq.Empty
            case e:Elem => n
            case _ =>{
               println("Unrecognized")
               throw new RuntimeExceptione
            }
         }
   }

   val newXML = new RuleTransformer(new Filter()).transform(xml) //returns filtered xml

Such code can become cumbersome when writing a ton of appendage rules each with a special case. Looking at the example from the github page makes this abundantly clear. Another drawback is the speed impact a large number of cases can generate since XML is re-written with every iteration. >a href=”https://github.com/chris-twiner/scalesXml”>Scales XML overcomes this burden but has a much less Scala-esque API and requires an additional download. Memory will be impacted as well as copies of the original information are stored.

Java Document

Unlike Scala XML, Java already has a built in in-place XML editor available after the release of Java 4. Importing this library is not outside the standard uses of Scala. Perhaps the Scala developers intended for their library to mainly be used in parsing. Hence, there is a mkString within the native library but a developer or programmer must download sbt with sbt.IO for a scala File System API that itself uses Java’s File class. There is also the lack of networking tools leading to third party tools and a Json library not quite excelling at re-writing Json in older Scala versions (non-existant in 2.11.0+). While it would be nice to see this change, there is no guarantee that it will. Instead, java fills in the gaps.

Here the javax.xml libraries come particularly in handy. The parsers and transformers can read and generate strings, making interoperability a breeze. Xpath expressions may not be as simple as in Scala where xml // “xpath” or xml / “xpath” will return the nodes but it is painless.

   import javax.xml.parsers._ //quick and dirty
   import org.w4c.dom._
   import javax.xml._    
   val xmlstr = <tag>My Text Value</tag>
   val doc:Document = DocumentBuilderFactory.newInstance.newDocumentBuilder.parse(new ByteArrayInputStream(xmlstr.getBytes))
   val nodes:NodeSet = xpath.compile("//tag").evaluate(doc,NodeConstants.NodeSet)

With a nodelist, one can getAttributes(), setAttribute(key,value), appendChild(node), insertBefore(nodea,nodeb) and call other useful methods, such as getNextSibling() or getPreviousSibling(), from the API. Just be careful to use the document element when making calls on the nodes where necessary.

Inter-operability

Inter-operability between scala and java is quite simple. Use a builder with an input stream obtained from the string to read an XML document to Java and a transformer to read that XML back to something Scala enjoys.

//an example getting a string with Pretty Printer from an existing document
   val xmlstr = <tag>My Text Value</tag>
   val doc: Document = DocumentBuilderFactory.newInstance.newDocumentBuilder.parse(new ByteArrayInputStream(xmlstr.getBytes))
   val printer = new scala.xml.PrettyPrinter(250,5)
   val sw = new java.io.StringWriter()
   val xmlString = javax.xml.transform.TransformerFactory.newInstance().newTransformer().transform(new javax.xml.transform.dom.DOMSource(doc),new javax.xml.transform.stream.StreamResult(sw))
   val xml = printer.format()
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s