Increase monit timeout on Postgres job start

proplex · proplex · commit 636f93d0f4ad · 2018-08-20T14:38:31.000-04:00
We've discovered a bug in the following scenario:

Using HA Postgres, with one node down, and the other attempting to
start.

The node attempting to start will take about 30-40 seconds to bootstrap,
waiting for the other node to potentially come online. By default,
`monit start` timeout is 30 seconds, which can in some cases cause a
loop where Postgres attempts to start and gets killed by monit in a
loop. This commit fixes that by extending the timeout to 60 seconds.
diff --git a/ci/release_notes.md b/ci/release_notes.md
@@ -0,0 +1,5 @@
+# Improvements
+
+Increase `monit start` timeout of the Postgres job to 60 seconds (previously 30
+seconds). This fixes a bug where the Postgres job would be prematurely killed by
+monit during boot.
diff --git a/jobs/postgres/monit b/jobs/postgres/monit
@@ -1,13 +1,13 @@
 check process postgres
   with pidfile /var/vcap/sys/run/postgres/postgres.pid
-  start program "/var/vcap/jobs/postgres/bin/ctl start"
+  start program "/var/vcap/jobs/postgres/bin/ctl start" with timeout 60 seconds
   stop  program "/var/vcap/jobs/postgres/bin/ctl stop"
   group vcap
 
 <% if p('postgres.replication.enabled') %>
 check process monitor
   with pidfile /var/vcap/sys/run/postgres/monitor.pid
-  start program "/var/vcap/jobs/postgres/bin/monitor start"
+  start program "/var/vcap/jobs/postgres/bin/monitor start" with timeout 60 seconds
   stop  program "/var/vcap/jobs/postgres/bin/monitor stop"
   group vcap
 <% end %>