Checking and Increasing the Connection Limit in PostgreSQL

It finally happened. I blew the max connection limit in my new  PostgreSQL install by starting too many grid nodes  with associated connection pools for a system I am writing.

The default limit of 100 connections is far too few, especially with two people doing database intensive tasks. This article explains how to check the number of connections used as a standard user and administrator and how the administrator can change the connection limit.

Default Connection Limit and Buffer Size

By default, PostgreSQL has a relatively low number of maximum allowed connections. The default limit is 100.

The limit is related to the size of the shared buffers. Connections utilize the memory in the shared buffers. By default, the shared buffer size is set to 8 gigabytes.

PostgreSQL is a versatile database. Amazon built Redshift on the system. However, the maximum sizes are low and should remain so to encourage developers and students not to abuse their system’s resources.

Common Error When Exceeding Size Limit

With an ever increasing amount of data available and still growing need for technology, connection limits will be exceeded on many systems. PostgreSQL throws the following error when this occurs:

psql: FATAL: remaining connection slots are reserved for non-replication superuser connections

In my case, the connection pool failed to reserve the required ten to twenty connections per highly database intensive job and released what connections it did acquire back to the database instantly. This left me with 96 used connections and four that were unattainable. Some of the 96 connections were by an ETL tool or PG Admin.

Querying the Connection Limit

It is possible for non-administrators to check the number of connections in use with the following query:

SELECT count(distinct(numbackends)) FROM pg_stat_database

This query puts together a view of database statistics and counts the number of currently active connections.

Administrators have the option of querying connections from the psql command line using:

SHOW max_connections

This query asks the database to return the value of the max_connections configuration variable described below.

Most configuration variables can be queried in psql. Some can even be set from the command line console. The max_connections variable cannot be set from the command line due to an associated increase or decrease in memory usage.

Setting Connection Related Variables

While the initial settings are low by todays settings, they can be set in your postgresql.conf file. The file is typically found at a location such as:

/var/lib/pgsql/X.X/data/postgresql.conf  or /usr/local/postgres/postgresql.conf

The default location is the data directory under your PostgreSQL installation.

Look for the following variables in the file:

shared_buffers = 8000MB
max_connections = 100

Choosing Size Limits

It is important to set size limits relevant to the size of the shared buffers. Dividing the RAM usage by the current number of connections in use or the maximum number allotted will give a nice over-estimate of total memory use per connection. Ensure that this multiple does not exceed the shared buffer size and the capacity of your RAM. The shared buffer size should generally be less than the amount of RAM in your machine.

The equality to  check with is:

shared_buffers < (RAM - padding) && (current_ram_use / max_connections) * desired_connections < (RAM - padding) && (current_ram_use / max_connections) * desired_connections < shared_buffers

Basically, ensure that you have enough RAM and that the potential memory use does not exceed RAM and the shared buffer size. This inequality uses the max_connections settings instead of the actual number of used connections for some extra assurance.

Reloading and Restarting

The configuration file can be reloaded in an active session without a restart using the sql command line console or through PG Admin with the query:

SELECT pg_reload_config()

This query does not require administrator privileges but will not work when setting the maximum number of allotted connections or buffer size. Any variable in the configuration file proceeded by the following comment requires restarting the database server:

# (change requires restart)

Both max_connections and shared_buffers cause a change in memory usage which cannot be reworked on the fly. From the Linux command line, restart the server with the command:

./postgresql restart

On Ubuntu, the restart script normally resides at /etc/init.d.

If PostgreSQL is setup as a service, use:

service postgresql restart

Conclusion

In today’s world where data is bountiful, it is easy to exceed the connection limit in PostgreSQL. This article reviewed how to reset the connection limit and considerations for resetting the limit and associated shared buffer size.

A Guide to Defining Business Objectives

It can be said that in the world of the actual developer when the client or boss stops complaining and critiquing, the developer was automated to the point of being redundant. Basically, if no one is complaining, the developer is out of work.

Everyone wants a flying car, even when the job involves ingestion. However, there is a fine line in a business between a business requirement and idealism or ignorance. This is especially true when reporting to a boss whose technical competence and knowledge is less than stellar. So what rules have I discovered that will keep you employed? How do you avoid going to far off track when creating SCRUM tasks? Read on.

This article is an opinion piece on defining objectives and, to some extent, building the infrastructure required for a project.

Defining Core Objectives

All actual rules should be tightly cropped to the actual end client’s needs. This means being social with the end user. The first thing to do should be to find the most pessimistic person in the office and the person who will be using your product most. They may be the same people. Strike up a nice conversation with them. Get to know them  and then start asking questions about what they absolutely need. If there are multiple people in the office interacting directly with your end product, work from the most pessimistic and necessary to the least, establishing an absolute core of necessity.

By default, your boss will usually be the most optimistic person in the office. They are often less knowledgeable of the technical requirements and often of technology in general. Your goal is to temper their expectations to something reasonable. I have found this to be true of business partners as well.  You should understand what they want as they will provide a small set of absolute requirements but also keep in mind that if they can squeeze gold from water, they will.

If you find yourself limiting your objectives, use memos and presentations to make sure that you are providing a solid line of reasoning for why something cannot be done. Learn the IEEE standards for documentation and get familiar with Microsoft Office or Libre Office. Always offer a solution and state what would be needed to actually accomplish an objective and why it may be infeasible. In doing so, you may find a compromise or a solution. Offer them as alternatives. Do not be overly technical with the less technical.

My line of work requires providing relatively sensitive information in bulk at speed with a fair degree of normalization and quality. Basically, I am developing and building a large distributed ingestion and ETL engine with unique requirements that do not fit existing tools. This process has been a wreck as I was new to development coming in, given a largely inappropriate set of technology, ignored, and asked to build a Netflix style stack from the hulk of Geo Cities.

Defining business requirements was the first task I struggled with. Competing and even conflicting interests came from everywhere. My boss wanted and to a large degree wants an auto-scaling system with a great degree of statistical prediction on top of a massive number of sources from just one person. My co-workers, clients in the most strict sense, want normalization, de-duplication, anomaly detection,  and a large number of enhancements on top of a high degree of speed. No one seemed to grasp the technical requirements but everyone had an idea of what they needed.

In solving this mess, I found that my most valuable resource was the end user who was both most pessimistic of what we could deliver and had less technical skill than hoped for.  She is extremely bright and quite amazing, so bringing her up to speed was a simple task. However, she was very vague about what she wanted. In this case, I was able to discern requirements from my bosses optimism and a set of questions posed to her. As she also creates the tickets stemming from issues in the system, she indirectly defines our objectives as well.

Available Technology

The availability of technology will determine how much you can do. Your company will often try to provide less than the required amount of technology. Use your standards based documentation, cost models, and business writing to jockey for more. If you are under-respected, find someone who has clout and push them to act on your behalf.

As a junior employee several years ago, I found myself needing to push for basic technologies. My boss wanted me to use Pentaho for highly concurrent yet state based networking tasks on large documents ranging from HTML to PDF. In addition to this, he wanted automation and a tool that could scale easily. Pentaho was a poor tool choice. Worse, I had no server access. It took one year before I was able to start leaning on a more senior employee to lobby for more leniency and after another year and a half, servers. It took another year before I had appropriate access. If I was not developing a company, one that now has clients, I would have quit. The important take away, get to know your senior employees and use them on your behalf when you need to. Bribes can be paid in donuts where I work.

Promise Appropriately, Deliver With Quality

Some organizations require under-promising and over-delivering. These tend to be large organizations with performance review systems in desperate need of an overhaul. If you find yourself in such a situation, lobby to change the system. A solid set of reasoning goes a long way.

Most of us are in a position to promise an appropriate number of features with improvements over time. If you use SCRUM, this model fits perfectly. Devise your tasks around this model. Know who is on your team and promise an appropriate unit of work. Sales targets are built around what you can deliver. They are kept on your quality and  the ease of handling your product. Do not deliver to little, you will be fired but don’t define so much as to raise exuberance to an unsatisfiable level.

In my ingestion job, promises are made based on quality and quantity. I use the SCRUM model to refine our promises. Knowing my new co-worker’s capacity, fairly dismal, and my own, swamped with creating the tool, I can temper our tasks to meet business goals. Over time, we are able to include more business requirements on top of the number of sources being output and improving existing tools.

Hire Talent

If you are in the position of being able to hire people to expand on what you can achieve,  I do not recommend telling your boss an entry level position will suffice as they will then find someone with no skill. Also, push to be in the loop on the hiring process. The right person can make or break a project. My current co-worker is stuck re-running old tasks as he had no knowledge of our required tools and concepts despite my memo to my boss. Over time, he will get better but, with little skill, that may be too long. Sometimes the most difficult higher ups are those who are nice at heart but ignorant in practice.

Tickets

Your core requirements are not the ten commandments. You are not defining a society and universal morals but a more organic project. Requirements and objectives will change over time. The best thing you can do is to establish a ticket system, choose a solid system as changing to a different tool later is difficult. Patterns from these tickets will create new tools, define more requirements, and help you to better define your process.

In finding an appropriate system, ask:

  • Do I need an API that can interact with my or my clients tools?
  • Do I need SCRUM and Kanban capabilities on top of ticketing?
  • How hard is it to communicate with the client or end user?

At my work, I implemented a manual SCRUM board for certain tasks which had a positive impact on my overwhelmed co-worker who found JIRA cumbersome and full of lag. It is. We use JIRA for bug reporting and associated Kanban capabilities.

Cost

Cost is that lurking issue many will ignore. You need to document cost and use it to explain the feasibility of an objective. When possible create statistical tools that you can use to predict the burden on profitability and justify decisions. Money is the most powerful reasoning tool you have.

Conclusion

This opinion piece reviewed my lessons for entry level software developers looking to learn how to define business objectives. Overall, my advice is to:

  • Define core objectives starting with the most important and pessimistic users
  • Dive into your bosses core requirements and use their optimism to define the icing on the cake
  • Build on your objectives and requirements over time
  • Be involved in your bosses decisions
  • Define an appropriate number of objectives that allow you to deliver quality work (you will build on your past work over time)
  • Communicate and use an appropriate project management framework
  • Track costs and build statistical tools
  • Learn IEEE standards based documentation such as Software Design Documents and Database Design Documents, get familiar with business writing
  • Make sure you hire the right people

 

Akka: Resolving a Split Brain Without a Headache

Anyone who develops distributed systems knows that there are many issues to resolve before reaching stability. Akka does not entirely avoid these issues and, while many can be handled through configuration or a few lines of could, some problems require extra leg work. One major problem is the split brain. This article explains what a split brain is and examines a solution that does not involve paying Lightbend.

See also:

Akka: An Introduction to Akka

Split Brains

Split brains  are the cell division of the concurrent programming world. When different nodes on a cluster cannot reach one another, they must decide how to handle the nodes they cannot reach. Without proper configuration in Akka, the nodes merely assume the other nodes are down and remove or gate them.  The previously single cluster has divided into two separate clusters.

In the world of concurrent programming, the question is not whether a split brain will occur but when. Networks crash. Hardware fails, needs to be upgraded or updated, or requires replacement every so often.

Unfortunately, there is no free way to automatically handle the problem in Akka. Auto downing, the only freely available method for resolving the unreachable state,  is actually not a solution to the split brain problem and will result in the separation of nodes into different clusters.

The following graphic on cell division illustrates the split brain problem. Notice how the two cells are completely independent of each other and yet perform the same role.

800px-Major_events_in_mitosis.svg

Strategies for Resolving a Split Brain

Lightbend, the company behind Akka, lays out several strategies for resolving a split brain. In a nutshell, they are:

Unfortunately,  Lightbend requires a paid subscription to access implementations of these strategies.

Custom Majority Split Brain Resolver

While the folks behind Akka do not provide free solutions to the split brain problem, they do provide the tools to implement one of the aforementioned strategies.

The following code utilizes the majority strategy:

The preStart method requests the receipt of messages regarding reachability in the cluster. Once the Unreachable message is caught, the code stores the relevant actor reference in a sequence of unreachable nodes and schedules the removal of all unreachable nodes after a period of time if the current set of nodes contains the majority of its kind. After the pres

Conclusion

A split brains is a serious problem. We reviewed ways to solve the issue and presented a free solution using the majority strategy.

 

PostgreSQL Faster Way to Check Column Existence

We often need to check that a column exists before creating it, especially when dealing with dynamically generated data being inserted into PostgreSQL. This brief article shows how to perform this operation quickly and offers a PGPLSQL solution for creating new columns in databases older than PostgreSQL 9.6.

Querying The Catalog

There are several ways to search the PostgreSQL catalog tables. The easiest for the programmer is to use the information_schema. The fastest is to build a custom query.

While it is enticing to query the information schema, it is not a fast operation.  Multiple nested loops are created to pull from the catalog tables even when looking for a single attribute.

The following query on the information_schema results in running a rather large set of operations nested within several loops:

EXPLAIN SELECT count(*) > 0 FROM information_schema.columns WHERE table_schema LIKE 'us_ut_sor' AND table_name LIKE 'dirtyrecords' AND column_name LIKE 'bullage'

When run, the output is as follows:

"Aggregate  (cost=3777.32..3777.33 rows=1 width=0)"
"  ->  Nested Loop Left Join  (cost=2.39..3777.32 rows=1 width=0)"
"        ->  Nested Loop  (cost=1.83..3775.73 rows=1 width=4)"
"              ->  Nested Loop Left Join  (cost=1.54..3775.42 rows=1 width=8)"
"                    Join Filter: (t.typtype = 'd'::"char")"
"                    ->  Nested Loop  (cost=0.84..3774.50 rows=1 width=13)"
"                          ->  Nested Loop  (cost=0.42..3770.19 rows=1 width=8)"
"                                ->  Nested Loop  (cost=0.00..3740.84 rows=1 width=8)"
"                                      Join Filter: (c.relnamespace = nc.oid)"
"                                      ->  Seq Scan on pg_namespace nc  (cost=0.00..332.06 rows=1 width=4)"
"                                            Filter: ((NOT pg_is_other_temp_schema(oid)) AND (((nspname)::information_schema.sql_identifier)::text ~~ 'us_ut_sor'::text))"
"                                      ->  Seq Scan on pg_class c  (cost=0.00..3407.26 rows=121 width=12)"
"                                            Filter: ((relkind = ANY ('{r,v,f}'::"char"[])) AND (((relname)::information_schema.sql_identifier)::text ~~ 'dirtyrecords'::text))"
"                                ->  Index Scan using pg_attribute_relid_attnum_index on pg_attribute a  (cost=0.42..29.35 rows=1 width=14)"
"                                      Index Cond: ((attrelid = c.oid) AND (attnum > 0))"
"                                      Filter: ((NOT attisdropped) AND (((attname)::information_schema.sql_identifier)::text ~~ 'bullage'::text) AND (pg_has_role(c.relowner, 'USAGE'::text) OR has_column_privilege(c.oid, attnum, 'SELECT, INSERT, UPDATE, REFE (...)"
"                          ->  Index Scan using pg_type_oid_index on pg_type t  (cost=0.42..4.30 rows=1 width=13)"
"                                Index Cond: (oid = a.atttypid)"
"                    ->  Nested Loop  (cost=0.70..0.90 rows=1 width=4)"
"                          ->  Index Scan using pg_type_oid_index on pg_type bt  (cost=0.42..0.58 rows=1 width=8)"
"                                Index Cond: (t.typbasetype = oid)"
"                          ->  Index Only Scan using pg_namespace_oid_index on pg_namespace nbt  (cost=0.29..0.31 rows=1 width=4)"
"                                Index Cond: (oid = bt.typnamespace)"
"              ->  Index Only Scan using pg_namespace_oid_index on pg_namespace nt  (cost=0.29..0.31 rows=1 width=4)"
"                    Index Cond: (oid = t.typnamespace)"
"        ->  Nested Loop  (cost=0.56..1.57 rows=1 width=4)"
"              ->  Index Scan using pg_collation_oid_index on pg_collation co  (cost=0.28..0.35 rows=1 width=72)"
"                    Index Cond: (a.attcollation = oid)"
"              ->  Index Scan using pg_namespace_oid_index on pg_namespace nco  (cost=0.29..1.21 rows=1 width=68)"
"                    Index Cond: (oid = co.collnamespace)"
"                    Filter: ((nspname  'pg_catalog'::name) OR (co.collname  'default'::name))"

This is truly nasty. In fact, any program running in O(n^2) or larger time will be less than ideal in this situation.

Limiting the O(n) time can be done by directly querying the catalog tables. The previous query was merely checking to see if a column existed under a given table and schema. The following custom query performs this operation much faster:

EXPLAIN SELECT count(*) > 0 FROM (SELECT q1.oid,q1.relname,q1.relowner,q1.relnamespace,q2.nspname FROM (SELECT oid,relname,relowner,relnamespace FROM pg_class) as q1 INNER JOIN (SELECT oid, * FROM pg_catalog.pg_namespace) as q2 ON q1.relnamespace = q2.oid WHERE q1.relname LIKE 'dirtyrecords' AND q2.nspname LIKE 'us_ut_sor') as oq1 INNER JOIN (SELECT attrelid,attname FROM pg_attribute) as oq2 ON oq1.oid = oq2.attrelid WHERE oq2.attname LIKE 'bullage'

While larger, many less operations are performed for a comparatively lower speed cost:

"Aggregate  (cost=292.44..292.45 rows=1 width=0)"
"  ->  Nested Loop  (cost=0.84..292.43 rows=1 width=0)"
"        ->  Nested Loop  (cost=0.42..289.64 rows=1 width=4)"
"              ->  Seq Scan on pg_namespace  (cost=0.00..281.19 rows=1 width=4)"
"                    Filter: (nspname ~~ 'us_ut_sor'::text)"
"              ->  Index Scan using pg_class_relname_nsp_index on pg_class  (cost=0.42..8.44 rows=1 width=8)"
"                    Index Cond: ((relname = 'dirtyrecords'::name) AND (relnamespace = pg_namespace.oid))"
"                    Filter: (relname ~~ 'dirtyrecords'::text)"
"        ->  Index Only Scan using pg_attribute_relid_attnam_index on pg_attribute  (cost=0.42..2.79 rows=1 width=4)"
"              Index Cond: ((attrelid = pg_class.oid) AND (attname = 'bullage'::name))"
"              Filter: (attname ~~ 'bullage'::text)"

 

Notice how the cost of the first query was 3777.32 while the second was merely 292.44. That is a not so small order of magnitude better

PGPLSQL Function

For databases versions prior to PostgreSQL 9.6, which introduces the syntax ALTER TABLE x ADD COLUMN IF NOT EXISTS y TYPE, the following PGPLSQL function performs the desired table alteration:

 

CREATE OR REPLACE FUNCTION add_column_if_not_exists(schema_name varchar(63), table_name varchar(63), column_name varchar(63),column_type varchar(1024)) RETURNS void AS
$BODY$
     DECLARE
          column_exists BOOLEAN;
     BEGIN     
        IF schema_name IS NOT NULL THEN
		SELECT count(*) > 0 INTO column_exists FROM (SELECT q1.oid,q1.relname,q1.relowner,q1.relnamespace,q2.nspname FROM (SELECT oid,relname,relowner,relnamespace FROM pg_class) as q1 INNER JOIN (SELECT oid, * FROM pg_catalog.pg_namespace) as q2 ON q1.relnamespace = q2.oid WHERE q1.relname LIKE table_name AND q2.nspname LIKE schema_name) as oq1 INNER JOIN (SELECT attrelid,attname FROM pg_attribute) as oq2 ON oq1.oid = oq2.attrelid WHERE oq2.attname LIKE column_name;
		IF column_exists IS FALSE THEN
		    EXECUTE 'ALTER TABLE '||schema_name||'.'||table_name||' ADD COLUMN '||column_name||' '||column_type;
                END IF;
	ELSE
		SELECT count(*) > 0 INTO column_exists FROM (SELECT oid,relname,relowner,relnamespace FROM pg_class WHERE relname LIKE table_name) as oq1 INNER JOIN (SELECT attrelid,attname FROM pg_attribute) as oq2 ON oq1.oid = oq2.attrelid WHERE oq2.attname LIKE column_name;
		IF column_exists IS FALSE THEN
		    EXECUTE 'ALTER TABLE '||table_name||' ADD COLUMN '||column_name||' '||column_type;
                END IF;
        END IF;

      
END;
$BODY$
LANGUAGE plpgsql;

We did not create a trigger that fires on all alter statements to avoid creating additional cost when not desired. The provided function also avoids a costly join if no schema is present.

Conclusion

In this article, we discovered that the information schema is not as ideal as it seems. Armed with this information, we created a better function to add columns to a table only if they do not exist.

Akka: An Introduction

Akkas documentation is immense. This series helps tackle the many components by providing a working example of the master slave design pattern built with this powerful tool. The following article reviews the higher level concepts behind Akka and its usage.

Links are provided to different parts of the Akka documentation throughout the article.

See also:

Akka: Resolving A Split Brain Without a Headache

Akka

Akka is a software tool  used to build multi-threaded and distributed systems based on the actor model. It takes care of lower level systems building by providing high level APIs for node and actor generation.

Actors are the primitives behind Akka. They are useful for performing repeated tasks concurrently.  Actors run until terminated, receiving work through message passing.

actorsys

Resource Usage

Akka is extremely lightweight. The creators boast that the tool can handle thousands of actors on a single machine.

Message passing occurs through mailboxes. The maximum number of messages a mailbox holds is configurable with a 1000 messages default but messages must be under one megabyte.

The Actor

The actor is the universal primitive used in Akka. Unlike when using threading in a program language, this primitive runs like a daemon server. As such, it should be shut down gracefully.

Actors are user created.

class MyActor extends Actor with ActorLogging{

     override def preStart()= log.debug("Starting")
     override def postStop()= log.debug("Stopping")
     override def preRestart(reason: Throwable, message: Option[Any]) = log.error(s"Restarting because of ${reason.message}. ${message}")     
     override def postRestart(reason : Throwable) = 

     override def receive():Receive={
         case _ => sender ! "Hello from Actor"
     }
}

object MyActor{
   def setupMyActor()={
        val conf = ConfigFactory.load()
        val system = ActorSystem("MySystem",conf)
        val actor : ActorRef = system.actorOf(Props[MyActor],name = "myactor") 
   }
}

 

The example above creates an actor and a Scala companion class for instantiation.

Actors must extend Actor. ActorLogging provides the log library. The optional functions preRestart and postRestart handle exceptions, while the optional preStart and postStop methods handle setup and tear down tasks. The basic actor above incorporates logging and error processing.

An actor can:

  • Create and supervise other actors
  • Maintain a State
  • Control the flow of work in a system
  • Perform a unit of work on request or repeatably
  • Send and receive messages
  • Return the results of a computation

Akka’s serialization is extremely powerful. Anything available across a cluster or on the classpath that implements Serializable can be sent to and from an actor. Instances of classes are de-serialized without having the programmer recast them.

When to Use Akka

Actor systems are not a universal solution. When not performing repeated tasks and not benefiting from high levels of concurrency, they are a hindrance.

State persistence also weighs heavily in the use of an actor system. Take the example of cookies in network requests. Maintaining different network sessions across different remote actors can be highly beneficial in ingestion.

Any task provided to an actor should contain very little new code and a limited number of configuration variables.

Questions that should be asked based on these concepts include:

  • Can I break up tasks into sufficiently large quantities of work to benefit from concurrency?
  • How often will  tasks be repeated in the system?
  • How minimal can I make the configuration for the actor if necessary?
  • Is there a need for state persistence?
  • Is there a significant need for concurrency or is it a nice thought?
  • Is there a resource constraint that distribution can solve or that will limit threading?

State is definitely a large reason to use Akka. This could be in the form of actually  maintaining variables or in the actor itself.

In some distributed use cases involving the processing of enormous numbers of short lived requests, the actors own state and Akka’s mailbox capabilities are what is most important. This is the reasoning behind tools built on Akka such as Spark.

As is always the case when deciding to create a new system, the following should be asked as well:

  • Can I use an existing tools such as Spark or Tensor Flow?
  • Does the time required to build the system outweigh the overall benefit it will provide?

Clustering

Clustering is available in Akka. Akka provides high level APIs for generating distributed clusters. Specified seed nodes handle communications, serving as the system’s entry point.

Network design is provided entirely by the developer. Since node generation, logging, fault tolerance, and basic communication are the only pieces of a distributed system Akka handle’s, any distribution model will suffice. Two common models are the master-slave and graph based models.

Akka clusters are resilient with well developed fault tolerance.

Configuration

Configuration is performed either through a file or in a class. Many components can be configured including logging levels, cluster components, metrics collection, and routers.

Conclusion

This article is the entry point for the Akka series, providing the basic understanding needed as we begin to build a cluster.

Sbt Pack With Xerial

Adding Jars to a classpath should not be a chore. Often, using retrieveManaged in an SBT build is not quite what we want. When dealing with more than a few minor dependencies, having each dependency placed in its own folder is problematic. This article discusses a solution to this issue using Xerial’s sbt-pack plugin.

 

Xerial Pack

Xerial offers a plugin that will package all jars in a single folder and create a bat for executing the configurable main class. This allows for every jar to be placed on the classpath without listing every folder. It also creates a single directory for all dependencies.

Simply place the following in project/plugins.sbt:

addSbtPlugin("org.xerial.sbt" % "sbt-pack" % "0.8.2")  // for sbt-0.13.x or higher

Then specify the packaging options in build.sbt:

packAutoSettings

In this instance, the main class will be automatically found. More options are discussed at the Xerial Github page.

Packaging the jars requires a single command:

sbt pack

Conclusion

Using the classpath with many dependencies does not need to be a chore. Simply import the Xerial plugin and run sbt pack

Enriching Scala With Implicits

Imagine you are an ETL developer using Spring Cloud Data Flow. Nothing is really available for distributed systems and streaming ETL that is as powerful as this tool. Alteryx and Pentaho are at least a year away from pushing out anything as capable. While Pentaho might work, there are just too many holes to fill.

However, you could do with a more compact code language than Java when programming for Spring. A powerful solution is to combine the Spring ecosystem with Scala, using implicits to eliminate redundant code.

This article focuses on using the enrichment pattern in Scala code through the IterableLike library and the concept of the implicit.

 

clock

Time is Money

Implicits

Implicits allow code that is in scope to be utilized when variables are not defined. Only one implicit of a type may be defined within a class:

implicit val myImplicitString : String = "hello there"

def printHello(str : String)={
   println(str)//should print "hello there"
}

This class creates an implicit string and utilizes it in the method printHello. Having two implicit strings confuses the compiler and causes it to crash.

Implicits create the possibility of using the enrichment pattern as described in the next to last section by attaching an implicit definition to a library when a non-existing function is called.

Remap An Object

Our enrichment example contains a method that removes an item from an IterableLike object only when a specific condition holds:

class IterableScalaFuncs[A,Repr](xs : IterableLike[A,Repr]){

    /**
      * Remove an object by a single property value.
      * @param f              The thunk to use
      * @param cbf            The CanBuildFrom to get a builder which should not be touched
      * @tparam That          The result type
      * @return That which should just be of type A
    */
    def removeObjectMatching[That](f : A => Boolean)(implicit cbf : CanBuildFrom[Repr,A,That]):That={
        val builder = cbf(xs.repr)
        val it = xs.iterator

        while(it.hasNext){
            val o = it.next()
            if(!f(o)){
                builder += o
            }
        }
        builder.result()
    }

}

IterableLikeScalaFuncs contains a removeObjectMatching method that takes the result as a type parameter, the thunk to match with in the parameter list, and implicitly connects the existing CanBuildFrom from IterableLike in the net parameter list. It then creates a builder of our type and proceeds to populate it with objects not matching the thunk before returning a new collection with the appropriate items removed.

Enrichment Pattern

The enrichment pattern in Scala embellishes libraries by appending code to them as opposed to creating a wrapper class which must be instantiated. This requires implicitly attaching method definitions to existing libraries but allows Scala to be easily used from the console and in classes.

The class in the previous section can be attached to all IterableLike collections implicitly:

/**
  * The enrichment for Iterables wrapping IterablLike with IterableScalaFuncs
  * @param xs       Our IterableLike object
  * @tparam A       The type of the Iterable
  * @tparam Repr    Traversable Repr
*/
implicit def enrichIterable[A, Repr](xs: IterableLike[A, Repr]) = new IterableScalaFuncs(xs)

The method enrichIterable attaches to the target collections and is used as follows:

import ScalaFuncImplicits._
val list : List[(Int,Int)] = List[(Int,Int)]((1,2),(2,3)).removeObjectMatching(_.1 == 1) //should product List((2,3))

Conclusion

This article reviewed the power of Scala Implicits to reduce redundant code without accounting for every type. The enrichment pattern can be used to easily integrate methods with existing libraries.

Code is available at Github