Skip to content

Commit 52b3ec8

Browse files
patjonesjhunt
authored andcommitted
HA Tuning (#30)
Previously the number of times psql was allowed to fail was hard-coded at 3. We found a need to adjust this this number so it is now a parameter defaulting to 3.
1 parent efcf5df commit 52b3ec8

File tree

4 files changed

+21
-6
lines changed

4 files changed

+21
-6
lines changed

README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -145,6 +145,13 @@ The following parameters affect high availability:
145145
the faster your cluster will failover, but the higher a risk
146146
of accidental failover and split-brain. Defaults to `5`.
147147

148+
- `postgres.replication.psql_error_count` - How many failed PSQL
149+
commands allowed before considering it a failure. The health
150+
checks are PSQL commands executed every second. Poor network
151+
conditions may result in a "Connection dropped" PSQL error.
152+
The lower this value, the higher potential for accidental
153+
failover and split-brain. Defaults to `3`.
154+
148155
- `vip.readonly_port` - Which port to access the read-only node
149156
of the cluster. Defaults to `7542`.
150157

ci/release_notes.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Improvements
2+
3+
- Added `postgres.replication.psql_error_count` configuration value
4+
which allows an operator to define how many failed health checks
5+
occur before failover.

jobs/postgres/spec

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -99,13 +99,15 @@ properties:
9999
postgres.replication.master:
100100
description: IP address of the preferred master node (should be the 0th postgres node's IP)
101101
default: ~
102-
103102
postgres.replication.grace:
104103
description: Grace period (in seconds) to look for an existing PostgreSQL master node on boot.
105104
default: 15
106105
postgres.replication.connect_timeout:
107-
description: How long (in seconds) to wait before timimg out a failover health check from one node to the other.
106+
description: How long (in seconds) to wait before timing out a failover health check from one node to the other.
108107
default: 5
108+
postgres.replication.psql_error_count:
109+
description: How many failed attempts to check the other node's status before assuming failure.
110+
default: 3
109111

110112
postgres.users:
111113
description: "A list of {username: ..., password: ...} objects for defining PostgreSQL users. Setting the 'admin:' key on a user will make them a superuser."
@@ -122,4 +124,4 @@ properties:
122124
- porcupine
123125
- hedgehog
124126
extensions: # optional array of extensions to enable on this database
125-
- citext
127+
- citext

jobs/postgres/templates/bin/functions

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,9 @@ is_master() {
1313

1414
# psql can experience transient issues (like connection reset)
1515
# make is_master more resilient against these kinds of errors
16+
error_tolerance=<%= p('postgres.replication.psql_error_count') %>
1617
error_count=0
17-
while (( $error_count < 3 )) ; do
18+
while (( $error_count < $error_tolerance )) ; do
1819
tf=$(echo $(psql $opts postgres -t -c 'SELECT pg_is_in_recovery()' 2>&1));
1920

2021
if [[ "$tf" == "f" ]]; then
@@ -25,12 +26,12 @@ is_master() {
2526
echo "[monitor] received unexpected response from postgres DB while checking master/replica status:"
2627
echo "[monitor] $tf"
2728
((error_count++))
28-
echo "[monitor] will attempt to check master/replica status again (check $error_count of 3)"
29+
echo "[monitor] will attempt to check master/replica status again (check $error_count of $error_tolerance)"
2930
sleep 1
3031
continue
3132
fi
3233
done
33-
# we errored out 3 times, return that other node is not master
34+
# we errored out <%= p('postgres.replication.psql_error_count') %> times, return that other node is not master
3435
echo "[monitor] couldn't determine who was master or replica due to postgres errors. assuming i'm master."
3536
return 1
3637
}

0 commit comments

Comments
 (0)