Automating Django Database Re-Creation on PostgreSQL


photo: Oren Ziv/

The Django database migration system is a mess. This mess is made more difficult when using PostgreSQL.

Schemas go unrecognized without a work around, indices can break when using the work around, and processes are very difficult to automate with the existing code base.

The authors of Django, or someone such as myself who occasionally finds the time to dip in and unclutter a mess or two, will eventually resolve this. However, the migration system today is a train wreck.

My current solution to migrating a Postgres based Django application is presented in this article.

Benefits of Getting Down and Dirty

The benefits of taking the time to master this process for your own projects are not limited to the ability to automate Django. They include the ability to easily manage migrations and even automate builds.

Imagine having a DDL that is defined in your application and changed with one line of code without ever touching your database.

Django is a powerful and concurrent web framework with an enormous amount of add-ons and features. Mastering this task can open your code base to amazing amounts of abstraction and potential.

Well Defined Problems

While Django is powerful, integration with Postgres is horrible. A schema detection problem can cause a multitude of problems as can issues related to using multiple databases. When combined, these problems sap hours of valuable time.

The most major issues are:

  • lack of migration support for multiple databases
  • lack of schema support (an issue recognized over 13 years ago)
  • some indices (PostGIS polygon ids here) break when using the schema workaround

Luckily, the solutions for these problems are simple and require hacking only a few lines of your configuration.

Solving the Multiple Database Issue

This issue is more annoying than a major problem. Simply obtain a list of your applications and use your database names as defined by your database router in migration.

If there are migrations you want to reset, use the following article instead of the migrate command to delete your migrations and make sure to drop tables from your database without deleting any required schemas.

Using python[version] migrate [app] zero  did not actually drop and recreate my tables for some reason.

For thoroughness, discover your applications using:

python[verion] showmigrations

With your applications defined, make the migrations and run the second line of code in an appropriate order for each application:

python[version] makemigrations []
python[version] migrate [app] --database= []

As promised, the solution is simple. The –database switch matches the database name in your configuration and must be present as Django only recognizes your default configuration if it is not. This can complete migrations without actually performing them.

Solving the Schema Problem without Breaking PostGIS Indices

Django offers terrific Postgres support, including support for PostGIS. However, Postgres schemas are not supported. Tickets were opened over 13 years ago but were merged and forgotten.

To ensure that schemas are read properly, setup a database router and add a database configuration utilizing the following template:

"default": {
        "ENGINE": "django.contrib.gis.db.backends.postgis",
        "NAME": "test",
        "USER": "test",
        "PASSWORD": "test",
        "HOST": "",
        "PORT": "5432",
        "OPTIONS": {
            "options": "-csearch_path=my_schema,public"

An set of options append to your dsn including a schema and a backup schema, used if the first schema is not present, are added via the configuration. Note the lack of spacing.

Now that is configured and a database router is established, add the following Meta class to each Model:

    class Meta():

Notice the awkward value for db_table. This is due to the way that tables are specified in Django. It is possible to leave managed as True, allowing Django to perform migrations, as long as the database is cleaned up a bit.

If there are any indices breaking on migration at this point, simply drop the table definition and use whichever schema this table ends up in. There is no apparent work around for this.

Now Your Migration Can Be Run In a Script, Even in a CI Tool

After quite a bit of fiddling, I found that it is possible to script and thus automate database builds. This is incredibly useful for testing.

My script, written to recreate a test environment, included a few direct SQL statements as well as calls to

echo "DROP SCHEMA IF EXISTS auth CASCADE" | python3 ./app_repo/simplred/ dbshell --settings=test_settings
echo "DROP SCHEMA IF EXISTS filestorage CASCADE" | python3 ./app_repo/simplred/ dbshell --settings=test_settings
echo "DROP SCHEMA IF EXISTS licensing CASCADE" | python3 ./app_repo/simplred/ dbshell --settings=test_settings
echo "DROP SCHEMA IF EXISTS simplred CASCADE" | python3 ./app_repo/simplred/ dbshell --settings=test_settings
echo "DROP SCHEMA IF EXISTS fileauth CASCADE" | python3 ./app_repo/simplred/ dbshell --settings=test_settings
echo "DROP SCHEMA IF EXISTS licenseauth CASCADE" | python3 ./app_repo/simplred/ dbshell --settings=test_settings
echo "DROP OWNED BY simplrdev" | python3 ./app_repo/simplred/ dbshell --settings=test_settings
echo "CREATE SCHEMA IF NOT EXISTS auth" | python3 ./app_repo/simplred/ dbshell --settings=test_settings
echo "CREATE SCHEMA IF NOT EXISTS filestorage" | python3 ./app_repo/simplred/ dbshell --settings=test_settings
echo "CREATE SCHEMA IF NOT EXISTS licensing" | python3 ./app_repo/simplred/ dbshell --settings=test_settings
echo "CREATE SCHEMA IF NOT EXISTS simplred" | python3 ./app_repo/simplred/ dbshell --settings=test_settings
echo "CREATE SCHEMA IF NOT EXISTS licenseauth" | python3 ./app_repo/simplred/ dbshell --settings=test_settings
echo "CREATE SCHEMA IF NOT EXISTS fileauth" | python3 ./app_repo/simplred/ dbshell --settings=test_settings
find ./app_repo -path "*/migrations/*.py" -not -name "" -delete
find ./app_repo -path "*/migrations/*.pyc"  -delete
python3 ./app_repo/simplred/ makemigrations --settings=test_settings
python3 ./app_repo/simplred/ migrate auth --settings=test_settings --database=auth
python3 ./app_repo/simplred/ migrate admin --settings=test_settings --database=auth
python3 ./app_repo/simplred/ migrate registration --settings=test_settings --database=auth
python3 ./app_repo/simplred/ migrate sessions --settings=test_settings --database=auth
python3 ./app_repo/simplred/ migrate oauth2_provider --settings=test_settings --database=auth
python3 ./app_repo/simplred/ migrate auth_middleware --settings=test_settings --database=auth
python3 ./app_repo/simplred/ migrate simplredapp --settings=test_settings --database=data
python3 ./app_repo/simplred/ loaddata --settings=test_settings --database=data fixture

Assuming the appropriate security measures are followed, this script works well in a Bamboo job.  My script drops and recreates any necessary database components as well as clears migrations and then creates and performs migrations. Remember, this script recreates a test environment.

The running Django application, which is updated via a webhook, doesn’t actually break when I do this. I now have a completely automated test environment predicated on merely merging pull requests into my test branch and ignoring any migrations folders through .gitignore.


PostgreSQL is powerful, as is Django, but combining the two requires some finagling to achieve maximum efficiency. Here, we examined a few issues encountered when using Django and Postgres together and discussed how to solve them.

We discovered that it is possible to automate database migration without too much effort if you already know the hacks that make this process work.


Checking and Increasing the Connection Limit in PostgreSQL

It finally happened. I blew the max connection limit in my new  PostgreSQL install by starting too many grid nodes  with associated connection pools for a system I am writing.

The default limit of 100 connections is far too few, especially with two people doing database intensive tasks. This article explains how to check the number of connections used as a standard user and administrator and how the administrator can change the connection limit.

Default Connection Limit and Buffer Size

By default, PostgreSQL has a relatively low number of maximum allowed connections. The default limit is 100.

The limit is related to the size of the shared buffers. Connections utilize the memory in the shared buffers. By default, the shared buffer size is set to 8 gigabytes.

PostgreSQL is a versatile database. Amazon built Redshift on the system. However, the maximum sizes are low and should remain so to encourage developers and students not to abuse their system’s resources.

Common Error When Exceeding Size Limit

With an ever increasing amount of data available and still growing need for technology, connection limits will be exceeded on many systems. PostgreSQL throws the following error when this occurs:

psql: FATAL: remaining connection slots are reserved for non-replication superuser connections

In my case, the connection pool failed to reserve the required ten to twenty connections per highly database intensive job and released what connections it did acquire back to the database instantly. This left me with 96 used connections and four that were unattainable. Some of the 96 connections were by an ETL tool or PG Admin.

Querying the Connection Limit

It is possible for non-administrators to check the number of connections in use with the following query:

SELECT count(distinct(numbackends)) FROM pg_stat_database

This query puts together a view of database statistics and counts the number of currently active connections.

Administrators have the option of querying connections from the psql command line using:

SHOW max_connections

This query asks the database to return the value of the max_connections configuration variable described below.

Most configuration variables can be queried in psql. Some can even be set from the command line console. The max_connections variable cannot be set from the command line due to an associated increase or decrease in memory usage.

Setting Connection Related Variables

While the initial settings are low by todays settings, they can be set in your postgresql.conf file. The file is typically found at a location such as:

/var/lib/pgsql/X.X/data/postgresql.conf  or /usr/local/postgres/postgresql.conf

The default location is the data directory under your PostgreSQL installation.

Look for the following variables in the file:

shared_buffers = 8000MB
max_connections = 100

Choosing Size Limits

It is important to set size limits relevant to the size of the shared buffers. Dividing the RAM usage by the current number of connections in use or the maximum number allotted will give a nice over-estimate of total memory use per connection. Ensure that this multiple does not exceed the shared buffer size and the capacity of your RAM. The shared buffer size should generally be less than the amount of RAM in your machine.

The equality to  check with is:

shared_buffers < (RAM - padding) && (current_ram_use / max_connections) * desired_connections < (RAM - padding) && (current_ram_use / max_connections) * desired_connections < shared_buffers

Basically, ensure that you have enough RAM and that the potential memory use does not exceed RAM and the shared buffer size. This inequality uses the max_connections settings instead of the actual number of used connections for some extra assurance.

Reloading and Restarting

The configuration file can be reloaded in an active session without a restart using the sql command line console or through PG Admin with the query:

SELECT pg_reload_config()

This query does not require administrator privileges but will not work when setting the maximum number of allotted connections or buffer size. Any variable in the configuration file proceeded by the following comment requires restarting the database server:

# (change requires restart)

Both max_connections and shared_buffers cause a change in memory usage which cannot be reworked on the fly. From the Linux command line, restart the server with the command:

./postgresql restart

On Ubuntu, the restart script normally resides at /etc/init.d.

If PostgreSQL is setup as a service, use:

service postgresql restart


In today’s world where data is bountiful, it is easy to exceed the connection limit in PostgreSQL. This article reviewed how to reset the connection limit and considerations for resetting the limit and associated shared buffer size.

Tune PostGres for Faster Inserts

Insertions in PostGreSQL take a while, especially for large amounts of data. A project of mine at work sees both large batch updates and another sees single insert statements in a way that they are preferable over others. There are ways to tune PostgreSQL to make sure that you are achieving the most out of your database.

[iframe style=”width:120px;height:240px;” marginwidth=”0″ marginheight=”0″ scrolling=”no” frameborder=”0″ src=”//®ion=US&placement=1491963417&asins=1491963417&linkId=7db8c8d34c4ab2fb9d75b1314869499d&show_border=false&link_opens_in_new_window=false&price_color=333333&title_color=0066c0&bg_color=ffffff”]

These variables should be in postgres.conf.

  • Tune the Wal (write-ahead logging) with wal_logging set to minimal and ensure a decent wal_writer_delay: When Wal logging times are short and content large, this can greatly effect servers. 10,000 record inserts went from 30 minutes to 2 seconds by extending the wal_writer_delay from per transaction (0 seconds) to 100 milliseconds. The default delay is 200 milliseconds
  • Increase thread limits: Increased thread limits help large transactions. A 10,000 record batch update in my case went from 15 minutes to 3 seconds when thread limits went from 10000 to 50000 threads

Be aware that other configurations will affect your system and that Wal logging is necessary for replication as Wal logs are used to recreate the state in a hot-swapping environment.

The Benefits of combining Fork Join pools with PostgresSQL

Warning: As I am incredibly busy at the moment, benchmarks are not necessarily provided. This is a performance review based on work experience and the accompanying documents for my superiors.

Postgresql is a terrific free tool. In my opinion, it can do almost all than an Oracle product and has quick functions for performing critical tasks such as dumping an entire database in a CSV from a programming language. However, insertion can be slow. I am constantly updating three or more databases with 15 or more attributes a piece. Unfortunately, my company makes an estimated revenue of the cost of an Oracle or Microsoft license. While PostgresSQL developers promise multi-threading in the future, Java has actually solved a significant portion of this task for us along with Apache using Fork Join Pools. The result can be an improvement by thousands of records.

Fork Join Pools and Why to use SE 8

Java SE 7 introduced a Fork Join Pool to the concurrent processes. Oracle recommends only using this for extremely intensive tasks. The tool itself works by sharing tasks among threads, work-stealing. Java SE 8 improves on the algorithm and reduces the number of tasks which are dropped.

Setting Up the Pool In Spring 4

I reviewed several different connection pools before settling on the vastly improved and newest version of Apache DBCP. BoneCP offers improved performance but at the cost of ignoring critical flaws. Chiefly, the current version fails to close connections properly leading the developer to recommend that I revert to version 0.8. The new Apache DBCP outperformed C3P0 in my benchmark tests. My outlook is provided below. It is based on a connection between both a machine and the internet to a co-location and from a machine to a local machine.

Connection Pool Pros Cons
Apache DBCP Reliable Somewhat slower than BoneCP
BoneCP Fast Somewhat Unreliable
C3PO Reliable Slow and worse option than DBCP.

Setting Up the DBCP is incredibly simple and configurable. Spring uses gradle or XML for configuration. For my purpose, XML provided a decent option since my teammates use XML in their daily work. For that reason, the XML configuration is provided below. A decent way to set this up is to use the data source properties bean as a reference in the declaration of the DAO Template’s data source.

<bean id="dataSource" destroy-method="close"
    <property name="driverClassName" value="${jdbcdriver}"/>
    <property name="url" value="${jdbcurl}"/>
    <property name="username" value="${dbcusername}"/>
    <property name="password" value="${dbcpass}"/>
    <property name="initialSize" value="3"/>
    <property name="validationQuery" value="Select 1" />

Other options are provided at the DBCP site. A validation query is provided here since I ran into an issue with the validation query in my own work.

A reference can be provided to the DAO template using:

     <bean id="jdbcTemplate" ref="dataSource"/>

Delivering the Code with a DAO Template: Avoiding Race Conditions

An important consideration is how multiple connections will be handled by the DAO Template. Fortunately, Java offers asynchronization and Spring allows for this to be declared even without declaring a method to be synchronized using the annotation @Async.

   protected void postData(Map data){


The method above is now Asynchronously accessed.

Spring accepts the upper class in the collection hierarchy that more common collections implement or extend. Basically, due to inheritance, the HashMap “is-a” map but Map cannot be recast to HashMap. Call the getClass() method to see this in action. If .getClass() returns java.util.HashMap, it is possible to recast Map to HashMap to gain any benefits beyond implementing the Map interface. Objects are passed by reference, by memory location more specifically, so this should be the case.

The Fork Join Pool

In this instance, the Fork Join Pool should accept Recursive Actions. They are not only useful in recursive tasks. Due to the work sharing, they improve I/O and other intensive tasks as well. Oddly, I ran into an issue where methods returning values with RecursiveTask (my parsers) would not completely close. I switched to using Callables instead. Fork Join Pools accept both Runnable, Callable “is-a” runnable, and ForkJoinTask classes explicitly. Threads, Callables, Runnable, ForkJoinTask, and RecursiveAction objects are acceptable.

Instantiation for this task requires:

ForkJoinPool fjp=new ForkJoinPool(Runtime.getRuntime().availableProcessors()*procnum);

The number of processes is the argument. Anything that does not fit an integer throws an IllegalArgumentException.

Keep in mind that Armdahl’s rule applies and that too many processes will cause the process to slow down, as does inappropriate use.

To submit a process use:

fjp.execute(new SQLRecursiveAction(/**parameters**/));

The SQL class (in this case an inner class) would be:

private SQLRecursiveAction extends RecursiveAction{
      private final String aString;
      public SQLRecursiveAction(String aString){


      protected void compute(){
            /**call to DAO Template**/

The compute method is required as it implements a protected abstract method (abstract methods must at least be default and accessible in abstract classes).

Once ready, do not shutdown the Fork Join Task. Instead, there are other ways to wait for completion that allow reuse.

int w=0;
while(fjp.isQuiescent() == false && fjp.getActiveThreadCount()>0){
//"Waited for "+w+" cycles");

When complete, shutdown is necessary. This should be done when finished with the pool.

fjp.shutdown() //orderly shutdown
//fjp.shutdownNow() for immediate shutdown


In the end, the Fork Join Pool significantly improved the performance of my application which can parse any type of document into attributes stored in JSon strings and posted to the database. The number of records I was able to parse and post increased dramatically from several hundred over one minute to over 10,000 records with the same number of attributes. It may also be wise to consider other improvements such as TokuTech’s fractal tree indexing for large and normalized databases.