Skip to content

Commit 244d47c

Browse files
proplexjhunt
authored andcommitted
Allow one-off PSQL errors during checks, check for split-brain (#26)
We found that the parameters around checking who was master was too strict. A single PSQL error (such as connection reset) for transient errors would put the replica into master and have a dual master-master (split-brain) configuration. We now changed that so that three consistent errors are necessary for the replica to become master in scenarios where the master is running, but not accepting PSQL commands. We've also added a check for split-brain configurations. We've piggy-backed the status checks to also check for scenarios where both nodes are master. If they are, both nodes immediately shut down their postgres, haproxy, and monitor processes. This sets the VM to "failure" status in BOSH, which should be a very easy find for those with monitoring solutions (e.g. Prometheus). To recover from this failure mode, look at README.md, where it is explained step-by-step (it's easy).
1 parent efc89a7 commit 244d47c

File tree

6 files changed

+111
-7
lines changed

6 files changed

+111
-7
lines changed

README.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -150,3 +150,37 @@ The following parameters affect high availability:
150150

151151
- `vip.vip` - Which IP to use as a VIP that is traded between the
152152
two nodes.
153+
154+
### HA Failure Modes
155+
156+
Our HA solution is focused on preventing downtime in the face of
157+
upgrades or other single-node failure. As such, we do not attempt to
158+
solve scenarios where the two databases cannot communicate with one
159+
another (e.g. network partition). In this case, it is possible that the
160+
replica believes the master to be down, and will promote itself to be
161+
master. The Postgres servers are then in a state of "split-brain" and
162+
requests to the DB will be split between the two nodes.
163+
164+
To mitigate this, each node checks to see who is master. If both
165+
nodes are master (split-brain), both immediately shut down to prevent
166+
inconsistent data states. *This will result in downtime*. But we
167+
believe downtime is preferable over inconsistent database states.
168+
169+
However, this mitigation is not a silver bullet; it is possible that
170+
prolonged network outage between the two nodes will prevent them from
171+
checking who is master, and will continue to operate in split-brain
172+
fashion. We do not attempt to solve this.
173+
174+
### Recovery From Failure Mode
175+
176+
After the database has been validated, and a node to become master
177+
is chosen, SSH into the node via `bosh ssh postgres/#` and then
178+
execute `/var/vcap/jobs/postgres/bin/recover` as root. This node
179+
will then become master.
180+
181+
Once the script executes successfully, then SSH into the other node
182+
via `bosh ssh postgres/#` and then execute
183+
`/var/vcap/jobs/postgres/bin/recover` as root. This node will then
184+
replicate from the new master.
185+
186+
You will now have a nominal Postgres running.

ci/release_notes.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
## Improvements
2+
3+
- Postgres deployed as HA will now shutdown in cases of split-brain.
4+
- Added an errand to help recover Postgres VMs after a failure mode.
5+
6+
If the two nodes notice both are master (which can occur in certain
7+
cases, see README.md for more information), we have opted to have
8+
both VMs shut down their Postgres processes in the interest of
9+
data integrity. To assist in this process, we have added a script
10+
to start the processes again. Please see README.md for more
11+
information on this process.

jobs/postgres/spec

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ templates:
88
bin/healthy: bin/healthy
99
bin/monitor: bin/monitor
1010

11+
bin/recover: bin/recover
12+
1113
bin/functions: bin/functions
1214

1315
data/properties.sh.erb: data/properties.sh

jobs/postgres/templates/bin/functions

Lines changed: 24 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,27 @@ is_master() {
1010
else
1111
opts="$opts -p <%= p('postgres.config')["port"] || 6432 %>"
1212
fi
13-
tf=$(echo $(psql $opts postgres -t -c 'SELECT pg_is_in_recovery()' 2>/dev/null));
14-
if [[ "$tf" == "f" ]]; then
15-
return 0
16-
fi
17-
return 1
18-
}
13+
14+
# psql can experience transient issues (like connection reset)
15+
# make is_master more resilient against these kinds of errors
16+
error_count=0
17+
while (( $error_count < 3 )) ; do
18+
tf=$(echo $(psql $opts postgres -t -c 'SELECT pg_is_in_recovery()' 2>&1));
19+
20+
if [[ "$tf" == "f" ]]; then
21+
return 0
22+
elif [[ "$tf" == "t" || "$tf" =~ (could not connect to server)|(starting up) ]]; then
23+
return 1
24+
else
25+
echo "[monitor] received unexpected response from postgres DB while checking master/replica status:"
26+
echo "[monitor] $tf"
27+
((error_count++))
28+
echo "[monitor] will attempt to check master/replica status again (check $error_count of 3)"
29+
sleep 1
30+
continue
31+
fi
32+
done
33+
# we errored out 3 times, return that other node is not master
34+
echo "[monitor] couldn't determine who was master or replica due to postgres errors. assuming i'm master."
35+
return 1
36+
}

jobs/postgres/templates/bin/monitor

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,17 @@ case $1 in
5353
while true; do
5454
sleep 1
5555
if is_master; then
56-
continue
56+
if ! is_master $MASTER_IP <%= port %>; then
57+
continue
58+
else
59+
echo "[monitor] split-brain detected. both nodes are master. shutting down postgres, haproxy, and monitor to prevent inconsistent data"
60+
sleep 2 # done to ensure the other node notices split-brain as well.
61+
/var/vcap/bosh/bin/monit stop postgres
62+
/var/vcap/bosh/bin/monit stop haproxy
63+
/var/vcap/bosh/bin/monit stop monitor
64+
rm -f $RUN_DIR/monitor.pid
65+
exit 0
66+
fi
5767
fi
5868

5969
# we are a replica, determine who we talk to
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
#!/bin/bash
2+
set -u # report the usage of uninitialized variables
3+
4+
if [ "$EUID" != 0 ]
5+
then echo "Please run recovery as root"
6+
exit 1
7+
fi
8+
9+
running_processes=$(ps cax | grep -Po "(haproxy)|(postgres)|(monitor)")
10+
if [[ ! -z "$running_processes" ]]; then
11+
echo "Services are currently running on this node that should've been stopped."
12+
echo "Currently running services that should not be running:"
13+
echo "$running_processes"
14+
exit 1
15+
else
16+
/var/vcap/bosh/bin/monit start monitor
17+
/var/vcap/bosh/bin/monit start postgres
18+
/var/vcap/bosh/bin/monit start haproxy
19+
EXITSTATUS=0
20+
sleep 2
21+
if [[ $(ps cax | grep -Pzo "(?s)^(?=.*\monitor\b)(?=.*\bpostgres\b)(?=.*\bhaproxy\b).*$") ]] ; then
22+
echo "Failed to start procceses."
23+
/var/vcap/bosh/bin/monit status
24+
exit 1
25+
else
26+
echo "All processes have been started."
27+
exit 0
28+
fi
29+
fi

0 commit comments

Comments
 (0)