Allow one-off PSQL errors during checks, check for split-brain (#26)

proplex · jhunt · commit 244d47c0e139 · 2018-09-06T16:41:46.000-04:00
We found that the parameters around checking who was master was too
strict. A single PSQL error (such as connection reset) for transient
errors would put the replica into master and have a dual master-master
(split-brain) configuration.

We now changed that so that three consistent errors are necessary for
the replica to become master in scenarios where the master is running,
but not accepting PSQL commands.

We've also added a check for split-brain configurations. We've
piggy-backed the status checks to also check for scenarios where both
nodes are master. If they are, both nodes immediately shut down their
postgres, haproxy, and monitor processes. This sets the VM to "failure"
status in BOSH, which should be a very easy find for those with
monitoring solutions (e.g. Prometheus). To recover from this failure
mode, look at README.md, where it is explained step-by-step (it's easy).
diff --git a/README.md b/README.md
@@ -150,3 +150,37 @@ The following parameters affect high availability:
 
   - `vip.vip` - Which IP to use as a VIP that is traded between the
     two nodes.
+
+### HA Failure Modes
+
+Our HA solution is focused on preventing downtime in the face of
+upgrades or other single-node failure. As such, we do not attempt to
+solve scenarios where the two databases cannot communicate with one
+another (e.g. network partition). In this case, it is possible that the
+replica believes the master to be down, and will promote itself to be
+master. The Postgres servers are then in a state of "split-brain" and
+requests to the DB will be split between the two nodes.
+
+To mitigate this, each node checks to see who is master. If both
+nodes are master (split-brain), both immediately shut down to prevent
+inconsistent data states. *This will result in downtime*. But we
+believe downtime is preferable over inconsistent database states.
+
+However, this mitigation is not a silver bullet; it is possible that
+prolonged network outage between the two nodes will prevent them from
+checking who is master, and will continue to operate in split-brain
+fashion. We do not attempt to solve this.
+
+### Recovery From Failure Mode
+
+After the database has been validated, and a node to become master
+is chosen, SSH into the node via `bosh ssh postgres/#` and then 
+execute  `/var/vcap/jobs/postgres/bin/recover` as root. This node 
+will then become master.
+
+Once the script executes successfully, then SSH into the other node
+via `bosh ssh postgres/#` and then execute 
+`/var/vcap/jobs/postgres/bin/recover` as root. This node will then
+replicate from the new master.
+
+You will now have a nominal Postgres running.
diff --git a/ci/release_notes.md b/ci/release_notes.md
@@ -0,0 +1,11 @@
+## Improvements
+
+- Postgres deployed as HA will now shutdown in cases of split-brain.
+- Added an errand to help recover Postgres VMs after a failure mode.
+
+If the two nodes notice both are master (which can occur in certain
+cases, see README.md for more information), we have opted to have
+both VMs shut down their Postgres processes in the interest of
+data integrity. To assist in this process, we have added a script
+to start the processes again. Please see README.md for more
+information on this process.
diff --git a/jobs/postgres/spec b/jobs/postgres/spec
@@ -8,6 +8,8 @@ templates:
   bin/healthy: bin/healthy
   bin/monitor: bin/monitor
 
+  bin/recover: bin/recover
+
   bin/functions: bin/functions
 
   data/properties.sh.erb: data/properties.sh
diff --git a/jobs/postgres/templates/bin/functions b/jobs/postgres/templates/bin/functions
@@ -10,9 +10,27 @@ is_master() {
   else
     opts="$opts -p <%= p('postgres.config')["port"] || 6432 %>"
   fi
-  tf=$(echo $(psql $opts postgres -t -c 'SELECT pg_is_in_recovery()' 2>/dev/null));
-  if [[ "$tf" == "f" ]]; then
-    return 0
-  fi
-  return 1
-}
+
+  # psql can experience transient issues (like connection reset)
+  # make is_master more resilient against these kinds of errors
+  error_count=0
+  while (( $error_count < 3 )) ; do
+    tf=$(echo $(psql $opts postgres -t -c 'SELECT pg_is_in_recovery()' 2>&1));
+    
+    if [[ "$tf" == "f" ]]; then
+      return 0
+    elif [[ "$tf" == "t" || "$tf" =~ (could not connect to server)|(starting up) ]]; then
+      return 1
+    else
+      echo "[monitor] received unexpected response from postgres DB while checking master/replica status:"
+      echo "[monitor] $tf"
+      ((error_count++))
+      echo "[monitor] will attempt to check master/replica status again (check $error_count of 3)"
+      sleep 1
+      continue
+    fi
+  done
+  # we errored out 3 times, return that other node is not master
+  echo "[monitor] couldn't determine who was master or replica due to postgres errors. assuming i'm master."
+  return 1 
+}
diff --git a/jobs/postgres/templates/bin/monitor b/jobs/postgres/templates/bin/monitor
@@ -53,7 +53,17 @@ case $1 in
     while true; do
       sleep 1
       if is_master; then
-        continue
+        if ! is_master $MASTER_IP <%= port %>; then
+          continue
+        else
+          echo "[monitor] split-brain detected. both nodes are master. shutting down postgres, haproxy, and monitor to prevent inconsistent data"
+          sleep 2 # done to ensure the other node notices split-brain as well.
+          /var/vcap/bosh/bin/monit stop postgres
+          /var/vcap/bosh/bin/monit stop haproxy
+          /var/vcap/bosh/bin/monit stop monitor
+          rm -f $RUN_DIR/monitor.pid
+          exit 0
+        fi
       fi
 
       # we are a replica, determine who we talk to
diff --git a/jobs/postgres/templates/bin/recover b/jobs/postgres/templates/bin/recover
@@ -0,0 +1,29 @@
+#!/bin/bash
+set -u # report the usage of uninitialized variables
+
+if [ "$EUID" != 0 ]
+  then echo "Please run recovery as root"
+  exit 1
+fi
+
+running_processes=$(ps cax | grep -Po "(haproxy)|(postgres)|(monitor)")
+if [[ ! -z "$running_processes" ]]; then
+  echo "Services are currently running on this node that should've been stopped."
+  echo "Currently running services that should not be running:"
+  echo "$running_processes"
+  exit 1
+else
+  /var/vcap/bosh/bin/monit start monitor
+  /var/vcap/bosh/bin/monit start postgres
+  /var/vcap/bosh/bin/monit start haproxy
+  EXITSTATUS=0
+  sleep 2
+  if [[ $(ps cax | grep -Pzo "(?s)^(?=.*\monitor\b)(?=.*\bpostgres\b)(?=.*\bhaproxy\b).*$") ]] ; then
+    echo "Failed to start procceses."
+    /var/vcap/bosh/bin/monit status
+    exit 1
+  else
+    echo "All processes have been started."
+    exit 0
+  fi
+fi