PostgreSQL Automatic Failover

High-Availibility for Postgres, based on Pacemaker and Corosync.

Frequently Asked Questions

Q: Why a new resource agent for PostgreSQL?

A: The resource-agents project already has a PostgreSQL agent. This agent supports stateless and multi-state setup of your PostgreSQL cluster, which make its code large and complex and make it a bit confusing and complicated to setup, with 32 parameters.

On top of this, because PostgreSQL did not support clean demote by this time, the RA tries hard to match PostgreSQL capabilities to Pacemaker requirement with complicated workarounds which makes it hard to manage. A lot boils down to the lock file requirement to protect the cluster against corruption after a demote. Because of it, you must respect a strict stop/start order of the nodes and you can not swap the Master role between nodes (let’s call that a “switchover”).

Moreover, the existing PostgreSQL agent takes control over the PostgreSQL configuration file through initial configuration and adjust it to the situation.

Our main objective was to write a new resource agent much simpler, with a code as simple as possible, with low Pacemaker setup requirement, as close to PostgreSQL current capabilities, non intrusive, easier and as robust as possible. PAF supports start/stop of any node in the cluster without blowing up everything else. It supports switchover without headache.

Being PostgreSQL DBA’s, we prefer to take care of PostrgeSQL’s setup ourselves and knowing the resource agent is not messing with setup or internal mechanisms. This make the Pacemaker setup much simpler and the PostgreSQL configuration much more flexible to the cluster topology.

Q: Why perl?

A: Let’s answer “why not bash?” first

We started this project with bash. Bash can be a powerful language and it is possible to make clean code with it. This is the perfect language for sysadmin. But some limitation inherent to the language convinced us to switch to a much advanced language.

One of the limitation is the need to call su, sudo or runuser to run commands as non-privileged user, eg. starting the PostgreSQL daemon. It seems to us much more logical, clean, lighter and safe to use setiud/setgid to drop privileges and execute the daemon. At least, it doesn’t require to mess with PAM or systemd in some environment setup.

Moreover, the language is much cleaner to read and offer more control structures and capabilities than bash. For complex and “large” project, we prefer to use something else than bash.

Why Perl ? Because we know perl. We could have used python, ruby, perl6, javascript, whatever-you-prefer, but we are just more comfortable with perl. No other arguments.

Q: But perl is heavier!

A: Just a bit more. Not that much. It’s not 10x the memory usage by bash.

Oh wait, by the way, did you see a lot of fencing agent are written in python or perl? Amongs the /usr/sbin/fence_* scripts on my system, I can find:

      1 #!/bin/bash
     15 #!/usr/bin/perl
     25 #!/usr/bin/python

Q: Why Pacemaker?

A: Pacemaker is the industry reference for high availability under Linux systems.

It is highly reliable, configurable, and supports many topologies, so it makes sense to benefit from its well tested features, instead of reinventing something that already works well.

Q: What versions?

A: PAF is designed to work with PostgreSQL 9.3 and higher.

About Pacemaker, we tested in various configurations, the stacks available on the following systems are confirmed to work with PAF:

Q: Will PAF protect me against data loss?

A: No, PAF will not do that.

The whole point of a resource agent is to automate the failover process, thus minimizing the RTO (Recovery Target Objective). By design, PAF does not interfere with your PostgreSQL’s configuration, it just has a minimum requirement (in short, the hot standby streaming replication has to be enabled). So when you configure PostgreSQL, you can choose whether you use synchronous or asynchronous replication. If you chose asynchronous replication (which is the default), then transactions will be committed on the primary before they can be applied on the standbys. In case of a failover, the most up-to-date standby will be promoted by PAF, thus minimizing the data loss, but there will still be some loss.

Service high availability (what PAF provides) and data high availability (what PostgreSQL’s synchronous replication provides) are two different concepts, and are sometimes mutually exclusive.

Q: Pacemaker triggered a failover, and now my old primary cannot join the cluster anymore, what should I do?

A: You need to rebuild your old primary from the new primary instance first.

See the “Failover” section in the administration page.