Avoiding Duplication Issues in SBT

It goes without saying that any series on sbt and sbt assembly needs to also have a small section on avoiding the dreaded deduplication issue.

This article reviews how to specify merging in sbt assembly as described on the sbt assembly Github page and examines the PathList for added depth.

Related Articles:

Merge Strategies

When building a fat JAR in sbt assembly, it is common to run into the following error:

[error] (*:assembly) deduplicate: different file contents found in the following:

This error proceeds a list of files with duplication issues.

The build.sbt file offers a way to avoid this error via the merge strategy. Using the error output, it is possible to choose an appropriate strategy to deal with duplication issues in assembly:

assemblyMergeStrategy in assembly := {
  case "Logger.scala" => MergeStrategy.first
  case "levels.scala" => MergeStrategy.first
  case "Tidier.scala" => MergeStrategy.first
  case "logback.xml" => MergeStrategy.first
  case "LogFilter.class" => MergeStrategy.first
  case PathList(ps @ _*) if ps.last startsWith "LogFilter" => MergeStrategy.first
  case PathList(ps @ _*) if ps.last startsWith "Logger" => MergeStrategy.first
  case PathList(ps @ _*) if ps.last startsWith "Tidier" => MergeStrategy.first
  case PathList(ps @ _*) if ps.last startsWith "FastDate" => MergeStrategy.first
  case x =>
    val oldStrategy = (assemblyMergeStrategy in assembly).value
    oldStrategy(x)
}

In this instance. The first discovered file listed in the sbt error log is chosen. The PathList obtains the entire path with last choosing the last part of the path.

A file name may be matched directly.

PathList

Sbt merge makes use of the PathList. The full object is quite small:

object PathList {
  private val sysFileSep = System.getProperty("file.separator")
  def unapplySeq(path: String): Option[Seq[String]] = {
    val split = path.split(if (sysFileSep.equals( """\""")) """\\""" else sysFileSep)
    if (split.size == 0) None
    else Some(split.toList)
  }
}

This code utilizes the the specified system separator, “\” by default, to split a path. The return type is a List of strings.

List has some special Scala based properties. For instance, it is possible to search for anything under javax.servlet.* using:

PathList("javax", "servlet", xs @ _*) 

xs @_* searches for anything after the javax.servlet package.

Conclusion

This article reviews some basics of the merge strategy in sbt with a further explanation of the PathList.

An Introduction to Using Spring With Scala: A Positive View with Tips

Many ask why mix Spring and Scala. Why not?

Scala is a resilient language and the movement to the Scala foundation has only made it and its interaction with Java stronger. Scala reduces those many lines of Java clutter to fewer significantly more readable lines of elegant code with ease. It is faster than Python and yet acquiring the same serialization capability while already having a much better functional aspect.

Spring is the powerful go to too for any Java programmer, abstracting everything from web app security and email to wiring a backend. Together, Scala 2.12+ and Spring make a potent duo.

This article examines a few key traits for those using Scala 2.12+ with Spring 3+.

bean

Mr. Bean

Some of the Benefits of Mixing

Don’t recreate the wheel in a language that already uses your favorite Java libraries.

Scala mixed with Spring:

  • Elminate Lines of Java Code
  • Obtains more functional power than Javaslang
  • Makes use of most if not all of the functionality of Spring in Scala
  • Places powerful streamlined threading capacity in the hands of the programmer
  • Creates much broader serialization capacity than Java

This is a non-exhaustive list of benefits.

When Dependency Injection Matters

Dependency injection is useful in many situations. At my workplace, I have developed tools that reduce thousands of lines of code to a few configuration scripts using Scala and Spring.

Dependency injection is useful when writing large amounts of code hinders productivity. It may be less useful when speed is the primary concern.

Annotation Configs

Every annotation in Spring works with Scala. @Service, @Controller, and the remaining stereotypes, @Autowired and all of the popular annotations are useable.

Using them is the same in Java as in Scala.

@Service
class QAController{
   ....
}

Scala.Beans

Unfortunately, Scala does not create getters and setters for beans. It is therefore necessary to use the specialized @BeanProperty from Scala.Beans. This property cannot be attached to a private variable.

@BeanProperty
val conf: String = null

If generating a boolean getter and setter, @BooleanBeanProperty should be used.

@BooleanBeanProperty
val isWorking : Boolean = false

Scala’s Beans package contains other useful tools that give some power over configuration.

Autowiring

Autowiring does not require jumping through hoops. The underlying principal is the same as when using Java. It is only necessary to combine the @BeanProperty with @Autowired.

@BeanProperty
@Autowired(required = false)
val tableConf : TableConfigurator = null

Here, the autowired tableConf property is not required.

Configuration Classes

The XML context in Spring is slated for deprecation. To make code that will last, it is necessary to use a configuration class. Scala works seamlessly with the @Configuration component.

@Configuration
class AppConfig{
  @Bean
  def getAnomalyConfigurator():AnomalyConfigurator= {
    new AnomalyConfigurator {
      override val maxConcurrent: Int = 3
      override val errorAcceptanceCriteria: Double = 5.0
      override val columLevelStatsTable: String = "test"
      override val maxHardDifference: Int = 100
      override val schemaLevelStatsTable: String = "test"
      override val runId: Int = 1
    }
  }


  @Bean
  def getQAController():QAController={
    new QAController
  }
}

As with Java, the configuration generates beans. There is no difference in useage between the languages.

Create Classes without Defining Them

One of the more interesting features of Java is the ability to define a class as needed. Scala can do this as well.

new AnomalyConfigurator {
      ... //properties to use
}

This feature is useful when creating configuration classes whose traits take advantage of Spring.

Creating a Context

Creating contexts is the same in Scala as in Java. The difference is that classOf[] must be used in place of the .class property to obtain class names.

val context: AnnotationConfigApplicationContext = new AnnotationConfigApplicationContext()
context.register(classOf[AppConfig])
context.refresh()
val controller: QAController = context.getBean(classOf[QAController])

Conclusion

Scala 2.12+ works seamlessly with Spring. Requisite annotations and tools are all available in the language which compiles to comparable Java byte code.

Code for this article is available on Github

Connecting Postgresql to an External Drive in Ubuntu

It finally happened. In looking for a new way to store and move data between computers and organizations securely and at low cost, I needed an external drive. That data, is a ton of business information stored in jsonb format in PostgreSQL. As it turns out mounting an ntfs drive in Linux with the proper read write tasks is not a simple task. However, create an external tablespace is.

Mounting a Drive

Mounting a drive with read and write permissions as it turns out takes a bit of work. First, format the drive to the ntfs standard if it is not using this type of file system. As stated, this is probably the most cross platform of file system standards. The drive should be mounted at this point. If it is not, you may have another issue. Use the blkid line to get the drives UUID and then edit the fstab folder.

sudo gedit /etc/fstab

Inside the file, add a line for a new folder under /media/[user]/[new_drive_name].

#my new drive with Read and Write Permissions
UUID=#######FROMABOVE##     /media/[user]/[new_drive_name] ntfs auto,users,permissions 0 0

After this make the folder at the directory. Then unplug and replug in your external drive. It should now appear under [new_drive_name] from above. Finally, grant the correct permissions (755 recommended). Change to the media/[user] directory first if you are not there.

sudo chmod 755 -R [new_drive_name]
chown postgres -R [new_drive_name]

That should do it for setting up your drive. It looks easy now but when you don’t know what to do it is the daunting part.

Create a Tablespace and Database

Finally, create the tablespace and database and you should be able to login to the database with the appropriate user. Done!

CREATE TABLESPACE [tablespace_name] OWNER [user] LOCATION 'media/[user]/[new_drive_name]';
CREATE DATABASE [name] OWNER [user] TABLESPACE [tablespace_name];

Don’t forget to update any tablespace with SET DEFAULT tablespace = [tablespace_name] to use the tablespace on the drive.

You should now be able to dump to the database, add/delete tables and more. Probably not a good idea to use anything other than HDD as I am not sure of the read write volume flash can handle. Also, dumps are better than direct writes for speed but you can now tote your data around air gapped, encrypted, and in jsonb/json/hstore format using what is a top tier database. PostgreSQL’s NoSQL is better than MongoDB according to EnterpriseDB.

Caveat

There may be a side effect where you need to attach the drive on Xenial to boot. I am not sure yet but it solved my startup issue. Use this with caution.

Scala’s Hidden Benefits

Scala should not be a fad but it should not also be the only language in your toolkit. Java is fast(-ish), Python is dynamic and Scala is to Java what Python is to C code. It is just easier and faster to write Java than C. Companies are moving away from Scala but imagine writing some of the code in Groovy that is written in Scala and a nightmare occurs. Yes GoLang exists now but GoLang cannot work directly with Java in the same project and is somewhat limited and at times slower than Java and Scala.

Scala maintains several fairly direct benefits. It works well at the API layer over code such as Java, it is terrific at condensing data manipulation, and it maintains much better syntax for concurrency, allowing much more straightforward behaviour than anything written in Java. One only needs to look at Scala Swing to understand the immense productivity benefits of Scala on these levels.

My Use Case

My use case is seemingly simple but actually quite complex. Scala’s niche is data. It is perfect for ETL and at using Java networking tools for pulling in data. However, creating a large amount of Scala code is a bad idea. Every line of Scala is several lines of Java. Therefore, in creating an acquisition and ETL system for the company Hygenics Data LLC, the decision was made to make concurrent code in Scala, an API in Scala, ETL in Scala, and most everything else in Java.

This left the majority of the code for the Goat ETL toolset to be written in Java. Goats eat everything and so does the system so the name is perfect. That includes interaction with Pentaho, streaming, basic classes for an API manager, a basic browser written with Apache Http Components and Rhino, and much more. However, concurrent tasks and their managers were written in Scala. The result was actually an incredibly fluent, fast, system that runs extremely lightly and significantly improves speed over Python. The boost was not insignificant and the extra management that can plague Java concurrency and generate multitudes of bugs and bottlenecks disappeared. Java’s countdown latch and other recent improvements were terrific but just don’t carry the simplicity and manageability of Scala.

ETL code was written in Scala and achieved an extreme degree of flexibility with minimal code and improved error handling. Where Java only recently incorporated something akin to a Try or an Option, Scala was practically built on these ideas. Validators, mapping, filtering, reduction, and other tasks are much simple in Scala.

The API was written in Scala entirely. The result was a set of configurable structures that were easy to write and reduced code by as much as fifty percent or more.

Significantly, Scala can make improvements over other classes simple via the Enhancement pattern and offers many benefits in generics, implicits, and other features which Java just does not have. Ask a Java only programmer what invariance is.

Regarding the advantages over Python:

  • The Scala/Java program ran much faster
  • The Scala/Java Program could make use of threads without GIL interference
  • The Scala/Java Program has tools such as Spark, Akka, and Pentaho at its disposal
  • The Apache networking tools allowed a much lower level of interaction
  • Scala’s concurrency tools are much better developed than Python’s

The result is visible on my Github page.

Mixing Code

The mix is actually intuitive and simple in both Intellij and Eclipse where projects generate both sets of folders. While many may try to avoid mixing code, anyone who programs Java should be able to pick up the Scala code fairly simply. A quick search and a bit of time and proficiency in both languages is feasible. It has always been my opinion that anyone should be able to extrapolate between languages and tools or even from languages to tools (e.g. distribution in Carte). The concepts that build a tool or language are not infinite.

Interestingly, it is quite easy to mix Java Spring into Scala as my ProxyManager shows. Since Scala compiles to Java byte code, it was possible to use Spring from Scala, reducing reliance on the clunky and relatively low quality framework that is Play.

Spark is Just About Dead

At least in its current form, Spark is on an outdated architecture. That is not to say it is dead since a stated goal is better iterative single nod processing. In 2008, when I first heard about distributed systems like Hadoop from a relative working at IBM, machines were pitiful. 64 gb was a decent hard disk size let alone a pipe dream for RAM. RAID was a few years from becoming a common term. Storage was a big part of the tech research game. Now, a common desktop box can have as much as 128 gb of RAM and a RAID drive. What does that have to do with Spark? Spark relies on reshuffling data with a Hadoop backbone built for those puny devices of the 2000s.

Network Latency

The major issue here is network latency. A few years ago, expanding calculations to large data sets accepted network latency as an unavoidable problem. While we would like the rate of convergence to dictate speed. With bus speeds, processor speeds, network speeds, and RAM size being a major consideration in practice, this ideal is never the case. Network latency with Spark’s shuffling system is horrific. It is an issue that tools like Tensor flow seek to address. A small test of running calculations with numpy, bidmach, spark, and some other systems in a previous article on a 10000 x 10000 matrix reveal just how bad this latency becomes. Numpy handily destroyed the distributed system which needed to reshuffle data, pull data in from other sources, perform the calculations, and push the results back to the master.

Thanks to Moore’s law and companies such as Intel and Nvidia, whole data sets that were distributed not long ago now fit easily in memory. Terabyte size hard drives are available to the commoner and again, 128 gb of RAM with a RAID 5 is not just a capability for top of the line motherboards anymore.

A Future in Better Grouped sets of Tensors

These large boxes are not the death toll for a distributed system in general. However, distributed systems need to make better use of the hardware stack in today’s world.

Tensors, like vectors but more generalized. Think of a multi-dimensional numpy like array that grows and shrinks. A system with the following features including the use of tensors would greatly improve modern distributed systems beyond what using hardware such as GPUs can achieve.

    • Grouping of tensors based on how commonly they are used together (by pid, task id, or another factor or using a predictive construct)
      Some user specified data structures for specifying initial groups of calculations belonging together
      Distribution of commonly grouped tensors in a way where latency between machines holding data in a group is reduced or the tensors reside on a single system
      Support for memory systems like a SAN or RAID
      Better use of the resources of large boxes
  • Think of tomorrow’s system as a distribution of these Tensors somewhat like Tensor Flow but more between Tensor Flow and Spark, capable of splitting equations and performing operations on extremely large sets while making the most of today’s beefy hardware.

    Remember technology grows. Today’s hot system is tomorrow’s Commodore.

    Demystifying Sparse Matrix Generation for NLP

    Hashing, its simple, its powerful, and when it is unique it can take off time in generating sparse matrices from text. There are multiple ways to build a sparse matrix but do you really know just how scipy comes up with its fast solution or the powers at some of the big companies develop matrices for NLP. The simple hashing trick has proven immensely powerful and quite fast in aiding in sparse matrix generation. Check out my github for a sample and where I also have C99 and text tiling tokenizer algorithms written.

    The Sparse Matrix

    Sparse matrices are great tools for data with large numbers of dimensions but many, many gaps. When a matrix of m x n would become quite large (millions or more data points) and yet coverage is much smaller (say 50 percent even), a sparse matrix can fit everything into memory and allow testing of new or improved algorithms without blowing memory or needing to deploy to a distributed framework. This is also useful when dealing with results from frameworks like Spark or Hadoop.

    Sparse matrices are composed of index and data pairings. Two common forms are Compressed Sparse Row matrices and Compressed Sparse Column matrices which store an index array pointing to where vectors (rows or columns) end and the data array that they point to.

    Which Hash for NLP

    Text hashes such as Horners method have been around for some time. For a matrix, we want to limit collisions. That is no small feat as hashes are sometimes guaranteed to generate collisions. Keeping the number of attributes large is helpful but not a complete solution. It is also important to generate a fairly unique hash.

    In my TextPreprocessor software on github, I used murmurhash. There are 32 bit and 128 bit versions of the hash for the current murmurhash3 implementation available under scala.util.hashing and even a 64 bit optimized version. Murmurhash, with its large number of bytes and ability to effect the low end of the digit range, and filtering helps generate a fairly unique hash. Sci-kit Learn uses a similar variant.

    Even murmurhash may not always work. A single bit hashing function has proven effective in eliminating much in the way of collisions. If the bit is 1, add 1. If not, subtract. Using modulus or another function may prove useful but testing is needed. In any case, the expected mean is now 0 for each column since there is a 50 percent chance you will be high and 50 percent chance low.

    Building with the Hashing Trick

    Building the matrix is fairly straightforward. First, generate and index and row or column array.

    Then, get the feature counts. Take in text, split to sentences to lines and then lines to words, and for each line hash the words and count their instances. Each sentence is a vector. It is highly advisable to remove the most frequent words by which I mean those with an occurrence 1.5 or more standard deviations beyond the typical occurrence and to remove common words (stop words) like a, the, most conjunctions and some others. Stemming is another powerful tool which removes endings like -ing. It is possible to use the power of multiple cores by returning a map with hash to count key value pairs. From here simply iterate line by line and add an entry to the index array as array[n – 1] + count if n > 0 or array[0] = count if n = 0 and the features to the data array. If we know the maximum number of features per line, it is possible to add the lines in a synchronized method. Again, this is all on my github.

    val future = getFeatures(getSentences(doc).map({x => x.splitWords(x)})).flatMap({x => addToMatrix(x)})
    

    The reason I mention splitting the task is that it can be faster, especially if using a tool like Akka. It would be advisable to assign a number to the document and insert each row based on its number to be able to track which English language sentence it belongs to as well.

    Thanks for sticking with me as I explore a simple yet potent concept.

    Sports, Running, and Big Data

    After aggravating an old knee injury, I have started to get into running. It really isn’t the kind of injury to prevent it like with an ACL but merely a loss of the connecting tendons in my left knee. Starting to realize that sports are a perfect place for data science like everyone before me, the thought dawned on me. Can I make myself faster using data? Is there an optimal set of parameters even if I need to use dimensionality reduction techniques on too many factors, that can tell me the ideal BMI, the best training regimen, and more. I have nowhere near enough data for this thought experiment and I many never but at least we can try. Even lacking data, I will see if I can at least get a regression analysis (the worst one where life and genetic factors play such a role) in just a year or so.

    The Data Points

    Sports are physical and so are the data points to state the really obvious. Running is terrain based, another obvious. However, what points do I need to capture? This is the first question in the great science of data.

    I need to understand course types, elevation gains and losses with average grade,percent trail, percent path, percent road, altitude average, altitude peak, altitude min, and mileage for terrain values. This may grow or shrink but LSA and other dimensionality reduction techniques exist.

    I also need to understand the physical. Body Mass Index (BMI), average caloric intake, supplement use possibly down to chemical composition, Lifting regimens for calves, hamstrings, quads, and core is also important.

    Weather is a factor in any sort of outdoor activity. Average temperature for runs during a period, temperatures at different points, and

    Finally, training is important. The same terrain values should be accounted for. Also, average long run at different time periods before a course, average short runs for those time periods, average run length for those periods need to be taken.
    Average pace times and mileage area also important and should further be taken at snapshots in time.

    Data that is difficult to obtain but may be useful include those accounting for medical history, past experience with the sport, in this case running, and periods of inactivity.

    Technically, anything that could have an effect on the outcome should be recorded and placed into vectors for further analysis.

    The Algorithms

    Data science algorithms are many. Where there are hidden patterns and the goal is prediction, neural nets are a strong predictive technique.

    Data science algorithms such as neural nets are dimensional. They rely on vectors. Each dimension (field) is placed in the appropriate column for the vector. Neural nets utilize these vectors to analyze vectors for patterns. They use multiple layers to account for other changes. Nodes in each layer are connected to nodes in the previous and next layer. Activation equations are formed which state which node will be used in the next layer. The output is used for prediction. In text one or two ‘hidden’ layers is sufficient. The hidden layer stands between the input layer (vectors separated by categories) and output layer.

    Forming the input layer depends on the goal, whether it is mileage, time or both. Clustering algorithms can be used to separate the vectors into common categories. Algorithms exist to determine the proper number of clusters.

    The convolutional neural network takes ideas from imaging and utilizes a convolution kernel to further improve the network.

    Unfortunately, neural nets take large amounts of data to train. To be accurate, the various hidden patterns and input categories need to have enough data to establish them. This could be five or more but the 32 data point rule is likely better. An alternative is a form of multiple regression whether it is the highly likely logistic regression or less likely exponential regression. Gains are likely to diminish as people become more fit and train harder.

    Final Remarks

    Stay tuned and I will try to find a working model.