Checking and Increasing the Connection Limit in PostgreSQL

It finally happened. I blew the max connection limit in my new  PostgreSQL install by starting too many grid nodes  with associated connection pools for a system I am writing.

The default limit of 100 connections is far too few, especially with two people doing database intensive tasks. This article explains how to check the number of connections used as a standard user and administrator and how the administrator can change the connection limit.

Default Connection Limit and Buffer Size

By default, PostgreSQL has a relatively low number of maximum allowed connections. The default limit is 100.

The limit is related to the size of the shared buffers. Connections utilize the memory in the shared buffers. By default, the shared buffer size is set to 8 gigabytes.

PostgreSQL is a versatile database. Amazon built Redshift on the system. However, the maximum sizes are low and should remain so to encourage developers and students not to abuse their system’s resources.

Common Error When Exceeding Size Limit

With an ever increasing amount of data available and still growing need for technology, connection limits will be exceeded on many systems. PostgreSQL throws the following error when this occurs:

psql: FATAL: remaining connection slots are reserved for non-replication superuser connections

In my case, the connection pool failed to reserve the required ten to twenty connections per highly database intensive job and released what connections it did acquire back to the database instantly. This left me with 96 used connections and four that were unattainable. Some of the 96 connections were by an ETL tool or PG Admin.

Querying the Connection Limit

It is possible for non-administrators to check the number of connections in use with the following query:

SELECT count(distinct(numbackends)) FROM pg_stat_database

This query puts together a view of database statistics and counts the number of currently active connections.

Administrators have the option of querying connections from the psql command line using:

SHOW max_connections

This query asks the database to return the value of the max_connections configuration variable described below.

Most configuration variables can be queried in psql. Some can even be set from the command line console. The max_connections variable cannot be set from the command line due to an associated increase or decrease in memory usage.

Setting Connection Related Variables

While the initial settings are low by todays settings, they can be set in your postgresql.conf file. The file is typically found at a location such as:

/var/lib/pgsql/X.X/data/postgresql.conf  or /usr/local/postgres/postgresql.conf

The default location is the data directory under your PostgreSQL installation.

Look for the following variables in the file:

shared_buffers = 8000MB
max_connections = 100

Choosing Size Limits

It is important to set size limits relevant to the size of the shared buffers. Dividing the RAM usage by the current number of connections in use or the maximum number allotted will give a nice over-estimate of total memory use per connection. Ensure that this multiple does not exceed the shared buffer size and the capacity of your RAM. The shared buffer size should generally be less than the amount of RAM in your machine.

The equality to  check with is:

shared_buffers < (RAM - padding) && (current_ram_use / max_connections) * desired_connections < (RAM - padding) && (current_ram_use / max_connections) * desired_connections < shared_buffers

Basically, ensure that you have enough RAM and that the potential memory use does not exceed RAM and the shared buffer size. This inequality uses the max_connections settings instead of the actual number of used connections for some extra assurance.

Reloading and Restarting

The configuration file can be reloaded in an active session without a restart using the sql command line console or through PG Admin with the query:

SELECT pg_reload_config()

This query does not require administrator privileges but will not work when setting the maximum number of allotted connections or buffer size. Any variable in the configuration file proceeded by the following comment requires restarting the database server:

# (change requires restart)

Both max_connections and shared_buffers cause a change in memory usage which cannot be reworked on the fly. From the Linux command line, restart the server with the command:

./postgresql restart

On Ubuntu, the restart script normally resides at /etc/init.d.

If PostgreSQL is setup as a service, use:

service postgresql restart

Conclusion

In today’s world where data is bountiful, it is easy to exceed the connection limit in PostgreSQL. This article reviewed how to reset the connection limit and considerations for resetting the limit and associated shared buffer size.

Advertisements

PostgreSQL for Converting NoSQL to SQL

In need of a solution to prep for ETL and recognizing that drill may not be as comprehensive as I can build with PGPLSQL as it only goes one layer deep, it was time to find a way to move from dynamically created Jsonb in PostgreSQL to PostgreSQL relational tables.

The solution was this little function. This can be built to use jsonb_split_array and other functions to easily and quickly build up functions that delve deeper than drill. Add the future master replication and seemingly improving distriution and threading to Enterprise DB’s growing set of accomplishments with Postgres and why use drill.

breakoutNoSQL(inputTable text,outputTable text,jsonColumn text,otherColumns text[],condition text)

Only othercolumns and condition can be null.

Code

CREATE OR REPLACE FUNCTION breakoutNoSQL(inputTable text,outputTable text,jsonColumn text,otherColumns text[],condition text,splitData boolean) RETURNS text[] AS
$BODY$
DECLARE
    k text;
    keys text[];
    stmt text;
    insertKeys text;
BEGIN
    IF outputTable IS NULL THEN
        RAISE EXCEPTION 'OUTPUT TABLE CANNOT BE NULL';	
    END IF;

    if inputTable IS NULL THEN
        RAISE EXCEPTION 'INPUT TABLE CANNOT BE NULL';
    END IF;

    --get the initial keys
    if condition IS NOT NULL THEN
       IF splitData IS TRUE THEN
	  execute 'SELECT array_agg(key) FROM (SELECT distinct(jsonb_object_keys(jsonb_array_elements('||jsonColumn||'::jsonb))) as key FROM '||inputTable||') as q1 WHERE '||condition into keys;
       ELSE
	execute 'SELECT array_agg(key) FROM (SELECT distinct(jsonb_object_keys('||jsonColumn||'::jsonb)) as key FROM '||inputTable||') as q1 WHERE '||condition into keys;
       END IF;
    else
       IF splitData IS TRUE THEN
	execute 'SELECT array_agg(key) FROM (SELECT distinct(jsonb_object_keys(jsonb_array_elements('||jsonColumn||'::jsonb))) as key FROM '||inputTable||') as q1' into keys;
       ELSE
	execute 'SELECT array_agg(key) FROM (SELECT distinct(jsonb_object_keys('||jsonColumn||'::jsonb)) as key FROM '||inputTable||') as q1' into keys;
       END IF;
    end if;

    IF keys IS NULL OR array_length(keys,1) = 0 THEN
	RAISE EXCEPTION 'NUMBER OF DISCOVERED KEYS WAS 0';
    END IF;

    --build the statement
    stmt = 'CREATE TABLE '||outputTable||' AS SELECT ';

    --build the insert keys statement 
    insertKeys = NULL;
    FOREACH k IN ARRAY keys LOOP
      if insertKeys IS NULL THEN
         insertKeys = '';
      else
         insertKeys = insertKeys||',';
      end if;
      insertKeys = insertKeys||'btrim(cast('||'j'||jsonColumn||'::jsonb->'||''''||k||''''||'as text),''"'') as '||k;
    END LOOP;

    if otherColumns IS NOT NULL THEN
	FOREACH k IN ARRAY otherColumns LOOP
           if insertKeys IS NULL THEN
            insertKeys = '';
           else
             insertKeys = insertKeys||',';
           end if;  
           insertKeys = insertKeys||k;
       END LOOP;
     END IF;
     	
    --concat to make full statement
    stmt = stmt||' '||insertKeys||' FROM '||' (SELECT *,';
    IF splitData IS TRUE THEN
      stmt = stmt||'jsonb_array_elements('||jsonColumn||'::jsonb) as j'||jsonColumn||' FROM '||inputTable||') as q1';
    ELSE
      stmt = stmt||jsonColumn||' as j'||jsonColumn||' FROM '||inputTable||') as q1';
    END IF;

    RAISE NOTICE 'QUERY: %',stmt;
    
    --execute and print statement
    RAISE NOTICE 'QUERY: %',stmt;
    execute stmt;
    
    --return the keys from json
    return keys;
END;
$BODY$
Language plpgsql;

Tune PostGres for Faster Inserts

Insertions in PostGreSQL take a while, especially for large amounts of data. A project of mine at work sees both large batch updates and another sees single insert statements in a way that they are preferable over others. There are ways to tune PostgreSQL to make sure that you are achieving the most out of your database.

These variables should be in postgres.conf.

  • Tune the Wal (write-ahead logging) with wal_logging set to minimal and ensure a decent wal_writer_delay: When Wal logging times are short and content large, this can greatly effect servers. 10,000 record inserts went from 30 minutes to 2 seconds by extending the wal_writer_delay from per transaction (0 seconds) to 100 milliseconds. The default delay is 200 milliseconds
  • Increase thread limits: Increased thread limits help large transactions. A 10,000 record batch update in my case went from 15 minutes to 3 seconds when thread limits went from 10000 to 50000 threads

Be aware that other configurations will affect your system and that Wal logging is necessary for replication as Wal logs are used to recreate the state in a hot-swapping environment.

Lack of Nested and Implicit Data Support in Drill, PostgreSQL and Pentaho when Working with Json Data

JSon is great. It can contain a variety of data types in an expected format. It is becoming easier and easier to work with Json in existing formats as well making it a future workhorse for NoSQL based ETL. However, and not in the least because NoSQL ingestion needs to result in relational tables using SQL standards, there is still one bug to work out. Ingestion with Json will not break out nested tables and requires direct knowledge of data to complete tasks.

This may seem petty but when millions of recods are being read, it clearly is not.

In drill, this could potentially be overcome by creating a table for every single submap we wish to analyze but CREATE TABLE from the tool itself will not work. Instead, it is necessary to limit use cases to the data we want to use.

In PostgreSQL, it is possible to concatenate JSon data using a query whose individual results can then be saved. It is also possible to ‘pop’ keys that are unneeded. However, this approach requires many different tables at one per somewhat normalized form. It also requires recombining data.


SELECT row_to_json(r.*) FROM (SELECT nonJson AS nonJson, ((((data->'level1')::jsonb - 'bkey1')::jsonb - 'bkey2')::jsonb -'bkey3')::jsonb AS jdata FROM table WHERE data::jsonb ?| array['mustHaveKey'] AND data::jsonb ?| array['notHaveKey'] IS FALSE) r

Drill is still much more ahead of the game than Pentaho and PostgreSQL in terms of conversion though. Postgresql can guess types but has no function to attempt to automatically generate tables. Pentaho requires explicit conversion as well.

Of course, if one already knows every key that will be present, this is not a problem. That, however, means more maintenance as it is then impossible to write programs to automatically handle changes to data. Perhaps implicit conversion will happen soon but any argument as to data type issues should really look at the depth of the standards and conform to them instead of complaining.

Building Indexes for Full Text on a Database

Text is difficult to store in a database. Math, much easier since the core separating variables are usually much more distinguishable (e.g. form a primary key on age, income, and zip code). Text has the issue of being fuzzy, and often contains multiple topics. There is an approach based on clustering which could alleviate most if not all of this problem. Using topic modeling, representative documents as would be used to choose the initial K clusters in the K-Means ++ algorithm, and the notion behind a blocking key (think blocks of objects), it is possible to develop a valid index and even primary key for storing text documents as a whole in a database.

Topic Modelling

Topic modelling encompasses quite a few algorithms from Bayes to Neural nets and even can use LDA. With block of documents and set number of documents, LDA can basically form a generic equation using Single Value Decomposition and some basic magic. If the document set is somewhat better defined, speed is important due to re-indexing, and the number of documents are plenty, training a Naiive Bayesian model might be preferential. NLTK has a decent trainer for Python programmers. The advantage to Bayes is not having to choose one category but using many and is preferential as a probability is determined for each category based on the ‘features’ provided. The feature frequencies determine the probability. Still, with more hard line modelling it is possible to use cosines on a groups most representative document or complete other tasks to develop a category ranking.

Either way, having the best category choice as an attribute is likely a good solution.

Choosing Representative Documents

A drawback to choosing a category is that the user or program wants to glean information not on the general category but specific pieces. For this, individual clusterings within the grouping are useful.

It is possible to use the idea behind the K-means++ algorithm to find representative texts in each category. Start by finding the most representative document as this will also be needed. Next distribute the cosines and find the farthest cosine away from this document, this is the next representative document. Then, take the average cosines that are furthest from this document and find the largest value. Continue this process to an optimal number of documents, perhaps four or five. Vectorize these documents for cosine analysis and save them in an appropriate table, likely providing a lookup table with category name and key and using the key as a foreign key in this table. PostgreSQL allows for NoSQL data to be stored in a more relational format using the hstore field in 9.5 or jsonB in 9.4.

The representative documents should then be used to find the cosines of each appropriately topic clustered document. This data will be stored for the blocking key, all of it.

Building a Blocking Key

Building the blocking key is a crucial step. It will speed up indexing and querying to limit large document sets to an appropriate number for future analysis. It also has the added advantage of encompassing more or less data more easily (again see PostreSQL’s NoSQL capabilities). This can be done by mashing together all of the discovered data, formatting numbers to a certain length of course or could be a more complicated process. Another option, still treating this data as a string is to use something akin to Horner’s hashing mechanism to generate a hash key for the data. Generating keys should avoid topic overlap as much as possible. Think of your new table more like a hash table. Since we used cosine similarity which has magnitude and direction being a vector but whose use does not provide true distance and direction due to the data, using distance for the blocking key is more difficult. ASN.1 can hash multiple strings but just throwing the strings together may produce a better numeric solution.

Messaging, ETL, and an AKKA Proposal

Data sources are becoming many. NoSQL can help aggregate multiple sources into a more coherent whole. Akka, which can split data across multiple sources, servers as a perfect way of writing distributed systems. The combination with messaging via Queues or Topics and the Master-Slave pattern could provide a significant boost to ETL. Using databases as messaging systems, it is easy to see how processes can kick start. My goal will be to create a highly concurrent system that takes data from a scraper, from any source as can be done with my Python crawl modules, write the data to a NoSQL based JSONB store in PostgreSQL, notify a set of parsers which then look at patterns in the data to determine how to ETL the data. This is not really revolutionary but a good test of concurrency and automation.

Results will be reported.

Collection with NoSQL and Storage with SQL

There are four really well known forms of NoSQL databases. They are key-value, document, column-family, and graph databases. In the case of ETL, key-value is a good way to expand data without worrying about what if anything is present. However, even in demoralized form, this is not the best storage solution for customer facing solutions. Therefore, data will be placed into a client facing database configured with relational PostgreSQL tables.

Messaging and Building Patterns for AKKA and Scala

With messaging and state machines, actual uses for an actor do not need to be known at runtime. During runtime, interactions or patterns force the actor to take on a different role. This can be accomplished with a simple case-switch statement. From here a message with the data to be transformed can be passed to an actor. This data, with a rowID, can then be parsed after an Actor reads a message from a Queue. The queue specifies conditions such as which Parser-Combinator to use and then completes an activity based on this. This is not incredibly different from the Message slip Pattern, just that no re-routing occurs.

The data would be aggregated using the available row ideas in batches of a certain size. Perhaps batch iterators would best do the trick in determining the size of the batch to process.

Returning Data back to the original Actor

Returning the data requires messaging as well. The message returns from the initial actor where it needs to be matched with the appropriate row.

Recap

To recap, the question is, can AKKA perform more generic ETL than comes in currently available Open Source Tools?

To test this question I am developing Akka ETL. The tool will take in scraped data (from processes that can be managed with the same messaging technique but not easily distributed due to statefullness and security). The design includes taking in completed sources from a database, acquiring data, messaging an Actor with the appropriate parsing information, receiving the transformed data from these actors and posting to a relational database.

The real tests will be maintaining data-deduplication, non-mixed data, and unique identifiers.