Checking and Increasing the Connection Limit in PostgreSQL

It finally happened. I blew the max connection limit in my new  PostgreSQL install by starting too many grid nodes  with associated connection pools for a system I am writing.

The default limit of 100 connections is far too few, especially with two people doing database intensive tasks. This article explains how to check the number of connections used as a standard user and administrator and how the administrator can change the connection limit.

Default Connection Limit and Buffer Size

By default, PostgreSQL has a relatively low number of maximum allowed connections. The default limit is 100.

The limit is related to the size of the shared buffers. Connections utilize the memory in the shared buffers. By default, the shared buffer size is set to 8 gigabytes.

PostgreSQL is a versatile database. Amazon built Redshift on the system. However, the maximum sizes are low and should remain so to encourage developers and students not to abuse their system’s resources.

Common Error When Exceeding Size Limit

With an ever increasing amount of data available and still growing need for technology, connection limits will be exceeded on many systems. PostgreSQL throws the following error when this occurs:

psql: FATAL: remaining connection slots are reserved for non-replication superuser connections

In my case, the connection pool failed to reserve the required ten to twenty connections per highly database intensive job and released what connections it did acquire back to the database instantly. This left me with 96 used connections and four that were unattainable. Some of the 96 connections were by an ETL tool or PG Admin.

Querying the Connection Limit

It is possible for non-administrators to check the number of connections in use with the following query:

SELECT count(distinct(numbackends)) FROM pg_stat_database

This query puts together a view of database statistics and counts the number of currently active connections.

Administrators have the option of querying connections from the psql command line using:

SHOW max_connections

This query asks the database to return the value of the max_connections configuration variable described below.

Most configuration variables can be queried in psql. Some can even be set from the command line console. The max_connections variable cannot be set from the command line due to an associated increase or decrease in memory usage.

Setting Connection Related Variables

While the initial settings are low by todays settings, they can be set in your postgresql.conf file. The file is typically found at a location such as:

/var/lib/pgsql/X.X/data/postgresql.conf  or /usr/local/postgres/postgresql.conf

The default location is the data directory under your PostgreSQL installation.

Look for the following variables in the file:

shared_buffers = 8000MB
max_connections = 100

Choosing Size Limits

It is important to set size limits relevant to the size of the shared buffers. Dividing the RAM usage by the current number of connections in use or the maximum number allotted will give a nice over-estimate of total memory use per connection. Ensure that this multiple does not exceed the shared buffer size and the capacity of your RAM. The shared buffer size should generally be less than the amount of RAM in your machine.

The equality to  check with is:

shared_buffers < (RAM - padding) && (current_ram_use / max_connections) * desired_connections < (RAM - padding) && (current_ram_use / max_connections) * desired_connections < shared_buffers

Basically, ensure that you have enough RAM and that the potential memory use does not exceed RAM and the shared buffer size. This inequality uses the max_connections settings instead of the actual number of used connections for some extra assurance.

Reloading and Restarting

The configuration file can be reloaded in an active session without a restart using the sql command line console or through PG Admin with the query:

SELECT pg_reload_config()

This query does not require administrator privileges but will not work when setting the maximum number of allotted connections or buffer size. Any variable in the configuration file proceeded by the following comment requires restarting the database server:

# (change requires restart)

Both max_connections and shared_buffers cause a change in memory usage which cannot be reworked on the fly. From the Linux command line, restart the server with the command:

./postgresql restart

On Ubuntu, the restart script normally resides at /etc/init.d.

If PostgreSQL is setup as a service, use:

service postgresql restart

Conclusion

In today’s world where data is bountiful, it is easy to exceed the connection limit in PostgreSQL. This article reviewed how to reset the connection limit and considerations for resetting the limit and associated shared buffer size.

Advertisements

PostgreSQL Faster Way to Check Column Existence

We often need to check that a column exists before creating it, especially when dealing with dynamically generated data being inserted into PostgreSQL. This brief article shows how to perform this operation quickly and offers a PGPLSQL solution for creating new columns in databases older than PostgreSQL 9.6.

Querying The Catalog

There are several ways to search the PostgreSQL catalog tables. The easiest for the programmer is to use the information_schema. The fastest is to build a custom query.

While it is enticing to query the information schema, it is not a fast operation.  Multiple nested loops are created to pull from the catalog tables even when looking for a single attribute.

The following query on the information_schema results in running a rather large set of operations nested within several loops:

EXPLAIN SELECT count(*) > 0 FROM information_schema.columns WHERE table_schema LIKE 'us_ut_sor' AND table_name LIKE 'dirtyrecords' AND column_name LIKE 'bullage'

When run, the output is as follows:

"Aggregate  (cost=3777.32..3777.33 rows=1 width=0)"
"  ->  Nested Loop Left Join  (cost=2.39..3777.32 rows=1 width=0)"
"        ->  Nested Loop  (cost=1.83..3775.73 rows=1 width=4)"
"              ->  Nested Loop Left Join  (cost=1.54..3775.42 rows=1 width=8)"
"                    Join Filter: (t.typtype = 'd'::"char")"
"                    ->  Nested Loop  (cost=0.84..3774.50 rows=1 width=13)"
"                          ->  Nested Loop  (cost=0.42..3770.19 rows=1 width=8)"
"                                ->  Nested Loop  (cost=0.00..3740.84 rows=1 width=8)"
"                                      Join Filter: (c.relnamespace = nc.oid)"
"                                      ->  Seq Scan on pg_namespace nc  (cost=0.00..332.06 rows=1 width=4)"
"                                            Filter: ((NOT pg_is_other_temp_schema(oid)) AND (((nspname)::information_schema.sql_identifier)::text ~~ 'us_ut_sor'::text))"
"                                      ->  Seq Scan on pg_class c  (cost=0.00..3407.26 rows=121 width=12)"
"                                            Filter: ((relkind = ANY ('{r,v,f}'::"char"[])) AND (((relname)::information_schema.sql_identifier)::text ~~ 'dirtyrecords'::text))"
"                                ->  Index Scan using pg_attribute_relid_attnum_index on pg_attribute a  (cost=0.42..29.35 rows=1 width=14)"
"                                      Index Cond: ((attrelid = c.oid) AND (attnum > 0))"
"                                      Filter: ((NOT attisdropped) AND (((attname)::information_schema.sql_identifier)::text ~~ 'bullage'::text) AND (pg_has_role(c.relowner, 'USAGE'::text) OR has_column_privilege(c.oid, attnum, 'SELECT, INSERT, UPDATE, REFE (...)"
"                          ->  Index Scan using pg_type_oid_index on pg_type t  (cost=0.42..4.30 rows=1 width=13)"
"                                Index Cond: (oid = a.atttypid)"
"                    ->  Nested Loop  (cost=0.70..0.90 rows=1 width=4)"
"                          ->  Index Scan using pg_type_oid_index on pg_type bt  (cost=0.42..0.58 rows=1 width=8)"
"                                Index Cond: (t.typbasetype = oid)"
"                          ->  Index Only Scan using pg_namespace_oid_index on pg_namespace nbt  (cost=0.29..0.31 rows=1 width=4)"
"                                Index Cond: (oid = bt.typnamespace)"
"              ->  Index Only Scan using pg_namespace_oid_index on pg_namespace nt  (cost=0.29..0.31 rows=1 width=4)"
"                    Index Cond: (oid = t.typnamespace)"
"        ->  Nested Loop  (cost=0.56..1.57 rows=1 width=4)"
"              ->  Index Scan using pg_collation_oid_index on pg_collation co  (cost=0.28..0.35 rows=1 width=72)"
"                    Index Cond: (a.attcollation = oid)"
"              ->  Index Scan using pg_namespace_oid_index on pg_namespace nco  (cost=0.29..1.21 rows=1 width=68)"
"                    Index Cond: (oid = co.collnamespace)"
"                    Filter: ((nspname  'pg_catalog'::name) OR (co.collname  'default'::name))"

This is truly nasty. In fact, any program running in O(n^2) or larger time will be less than ideal in this situation.

Limiting the O(n) time can be done by directly querying the catalog tables. The previous query was merely checking to see if a column existed under a given table and schema. The following custom query performs this operation much faster:

EXPLAIN SELECT count(*) > 0 FROM (SELECT q1.oid,q1.relname,q1.relowner,q1.relnamespace,q2.nspname FROM (SELECT oid,relname,relowner,relnamespace FROM pg_class) as q1 INNER JOIN (SELECT oid, * FROM pg_catalog.pg_namespace) as q2 ON q1.relnamespace = q2.oid WHERE q1.relname LIKE 'dirtyrecords' AND q2.nspname LIKE 'us_ut_sor') as oq1 INNER JOIN (SELECT attrelid,attname FROM pg_attribute) as oq2 ON oq1.oid = oq2.attrelid WHERE oq2.attname LIKE 'bullage'

While larger, many less operations are performed for a comparatively lower speed cost:

"Aggregate  (cost=292.44..292.45 rows=1 width=0)"
"  ->  Nested Loop  (cost=0.84..292.43 rows=1 width=0)"
"        ->  Nested Loop  (cost=0.42..289.64 rows=1 width=4)"
"              ->  Seq Scan on pg_namespace  (cost=0.00..281.19 rows=1 width=4)"
"                    Filter: (nspname ~~ 'us_ut_sor'::text)"
"              ->  Index Scan using pg_class_relname_nsp_index on pg_class  (cost=0.42..8.44 rows=1 width=8)"
"                    Index Cond: ((relname = 'dirtyrecords'::name) AND (relnamespace = pg_namespace.oid))"
"                    Filter: (relname ~~ 'dirtyrecords'::text)"
"        ->  Index Only Scan using pg_attribute_relid_attnam_index on pg_attribute  (cost=0.42..2.79 rows=1 width=4)"
"              Index Cond: ((attrelid = pg_class.oid) AND (attname = 'bullage'::name))"
"              Filter: (attname ~~ 'bullage'::text)"

 

Notice how the cost of the first query was 3777.32 while the second was merely 292.44. That is a not so small order of magnitude better

PGPLSQL Function

For databases versions prior to PostgreSQL 9.6, which introduces the syntax ALTER TABLE x ADD COLUMN IF NOT EXISTS y TYPE, the following PGPLSQL function performs the desired table alteration:

 

CREATE OR REPLACE FUNCTION add_column_if_not_exists(schema_name varchar(63), table_name varchar(63), column_name varchar(63),column_type varchar(1024)) RETURNS void AS
$BODY$
     DECLARE
          column_exists BOOLEAN;
     BEGIN     
        IF schema_name IS NOT NULL THEN
		SELECT count(*) > 0 INTO column_exists FROM (SELECT q1.oid,q1.relname,q1.relowner,q1.relnamespace,q2.nspname FROM (SELECT oid,relname,relowner,relnamespace FROM pg_class) as q1 INNER JOIN (SELECT oid, * FROM pg_catalog.pg_namespace) as q2 ON q1.relnamespace = q2.oid WHERE q1.relname LIKE table_name AND q2.nspname LIKE schema_name) as oq1 INNER JOIN (SELECT attrelid,attname FROM pg_attribute) as oq2 ON oq1.oid = oq2.attrelid WHERE oq2.attname LIKE column_name;
		IF column_exists IS FALSE THEN
		    EXECUTE 'ALTER TABLE '||schema_name||'.'||table_name||' ADD COLUMN '||column_name||' '||column_type;
                END IF;
	ELSE
		SELECT count(*) > 0 INTO column_exists FROM (SELECT oid,relname,relowner,relnamespace FROM pg_class WHERE relname LIKE table_name) as oq1 INNER JOIN (SELECT attrelid,attname FROM pg_attribute) as oq2 ON oq1.oid = oq2.attrelid WHERE oq2.attname LIKE column_name;
		IF column_exists IS FALSE THEN
		    EXECUTE 'ALTER TABLE '||table_name||' ADD COLUMN '||column_name||' '||column_type;
                END IF;
        END IF;

      
END;
$BODY$
LANGUAGE plpgsql;

We did not create a trigger that fires on all alter statements to avoid creating additional cost when not desired. The provided function also avoids a costly join if no schema is present.

Conclusion

In this article, we discovered that the information schema is not as ideal as it seems. Armed with this information, we created a better function to add columns to a table only if they do not exist.

PostgreSQL for Converting NoSQL to SQL

In need of a solution to prep for ETL and recognizing that drill may not be as comprehensive as I can build with PGPLSQL as it only goes one layer deep, it was time to find a way to move from dynamically created Jsonb in PostgreSQL to PostgreSQL relational tables.

The solution was this little function. This can be built to use jsonb_split_array and other functions to easily and quickly build up functions that delve deeper than drill. Add the future master replication and seemingly improving distriution and threading to Enterprise DB’s growing set of accomplishments with Postgres and why use drill.

breakoutNoSQL(inputTable text,outputTable text,jsonColumn text,otherColumns text[],condition text)

Only othercolumns and condition can be null.

Code

CREATE OR REPLACE FUNCTION breakoutNoSQL(inputTable text,outputTable text,jsonColumn text,otherColumns text[],condition text,splitData boolean) RETURNS text[] AS
$BODY$
DECLARE
    k text;
    keys text[];
    stmt text;
    insertKeys text;
BEGIN
    IF outputTable IS NULL THEN
        RAISE EXCEPTION 'OUTPUT TABLE CANNOT BE NULL';	
    END IF;

    if inputTable IS NULL THEN
        RAISE EXCEPTION 'INPUT TABLE CANNOT BE NULL';
    END IF;

    --get the initial keys
    if condition IS NOT NULL THEN
       IF splitData IS TRUE THEN
	  execute 'SELECT array_agg(key) FROM (SELECT distinct(jsonb_object_keys(jsonb_array_elements('||jsonColumn||'::jsonb))) as key FROM '||inputTable||') as q1 WHERE '||condition into keys;
       ELSE
	execute 'SELECT array_agg(key) FROM (SELECT distinct(jsonb_object_keys('||jsonColumn||'::jsonb)) as key FROM '||inputTable||') as q1 WHERE '||condition into keys;
       END IF;
    else
       IF splitData IS TRUE THEN
	execute 'SELECT array_agg(key) FROM (SELECT distinct(jsonb_object_keys(jsonb_array_elements('||jsonColumn||'::jsonb))) as key FROM '||inputTable||') as q1' into keys;
       ELSE
	execute 'SELECT array_agg(key) FROM (SELECT distinct(jsonb_object_keys('||jsonColumn||'::jsonb)) as key FROM '||inputTable||') as q1' into keys;
       END IF;
    end if;

    IF keys IS NULL OR array_length(keys,1) = 0 THEN
	RAISE EXCEPTION 'NUMBER OF DISCOVERED KEYS WAS 0';
    END IF;

    --build the statement
    stmt = 'CREATE TABLE '||outputTable||' AS SELECT ';

    --build the insert keys statement 
    insertKeys = NULL;
    FOREACH k IN ARRAY keys LOOP
      if insertKeys IS NULL THEN
         insertKeys = '';
      else
         insertKeys = insertKeys||',';
      end if;
      insertKeys = insertKeys||'btrim(cast('||'j'||jsonColumn||'::jsonb->'||''''||k||''''||'as text),''"'') as '||k;
    END LOOP;

    if otherColumns IS NOT NULL THEN
	FOREACH k IN ARRAY otherColumns LOOP
           if insertKeys IS NULL THEN
            insertKeys = '';
           else
             insertKeys = insertKeys||',';
           end if;  
           insertKeys = insertKeys||k;
       END LOOP;
     END IF;
     	
    --concat to make full statement
    stmt = stmt||' '||insertKeys||' FROM '||' (SELECT *,';
    IF splitData IS TRUE THEN
      stmt = stmt||'jsonb_array_elements('||jsonColumn||'::jsonb) as j'||jsonColumn||' FROM '||inputTable||') as q1';
    ELSE
      stmt = stmt||jsonColumn||' as j'||jsonColumn||' FROM '||inputTable||') as q1';
    END IF;

    RAISE NOTICE 'QUERY: %',stmt;
    
    --execute and print statement
    RAISE NOTICE 'QUERY: %',stmt;
    execute stmt;
    
    --return the keys from json
    return keys;
END;
$BODY$
Language plpgsql;

Tune PostGres for Faster Inserts

Insertions in PostGreSQL take a while, especially for large amounts of data. A project of mine at work sees both large batch updates and another sees single insert statements in a way that they are preferable over others. There are ways to tune PostgreSQL to make sure that you are achieving the most out of your database.

These variables should be in postgres.conf.

  • Tune the Wal (write-ahead logging) with wal_logging set to minimal and ensure a decent wal_writer_delay: When Wal logging times are short and content large, this can greatly effect servers. 10,000 record inserts went from 30 minutes to 2 seconds by extending the wal_writer_delay from per transaction (0 seconds) to 100 milliseconds. The default delay is 200 milliseconds
  • Increase thread limits: Increased thread limits help large transactions. A 10,000 record batch update in my case went from 15 minutes to 3 seconds when thread limits went from 10000 to 50000 threads

Be aware that other configurations will affect your system and that Wal logging is necessary for replication as Wal logs are used to recreate the state in a hot-swapping environment.