-
Notifications
You must be signed in to change notification settings - Fork 34
/
Copy path_mysql_bootstrapping.html.md.erb
548 lines (389 loc) · 20.9 KB
/
_mysql_bootstrapping.html.md.erb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
This topic describes how to bootstrap your MySQL cluster in the event of a
cluster failure.
You can bootstrap your cluster by using one of two methods:
- Run the bootstrap errand. See [Run the Bootstrap Errand](#assisted-bootstrap) below.
- Bootstrap manually. See [Bootstrap Manually](#manual-bootstrap) below.
## <a id="when-to-bootstrap"></a> When to Bootstrap
You must bootstrap a cluster that loses quorum. A cluster loses quorum when less than half of the
nodes can communicate with each other for longer than the configured grace period. If a cluster
does not lose quorum, individual unhealthy nodes automatically rejoin the cluster after resolving
the error, restarting the node, or restoring connectivity.
To check whether your cluster has lost quorum, look for the following symptoms:
- All nodes appear "Unhealthy" on the proxy dashboard, viewable at <%= vars.switchboard_url %>:
![3 out of 3 nodes are unhealthy.](quorum-lost.png)
- All responsive nodes report the value of `wsrep_cluster_status` as `non-Primary`:
<pre class="terminal">
mysql> SHOW STATUS LIKE 'wsrep_cluster_status';
+----------------------+-------------+
| Variable_name | Value |
+----------------------+-------------+
| wsrep_cluster_status | non-Primary |
+----------------------+-------------+
</pre>
<!-- Is it responsive or unresponsive? -->
- All unresponsive nodes respond with `ERROR 1047` when using most statement types
in the MySQL client:
<pre class="terminal">
mysql> select * from mysql.user;
ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use
</pre>
For more information about checking the state of your cluster, see
[Check Cluster State](https://docs.pivotal.io/p-mysql/1-10/troubleshooting.html#check-state).
## <a id="assisted-bootstrap"></a>Run the Bootstrap Errand
MySQL for PCF includes a
[BOSH errand](http://bosh.io/docs/jobs.html#jobs-vs-errands) that automates the manual
bootstrapping procedure in the [Bootstrap Manually](#manual-bootstrap) section below.
It finds the node with the highest transaction sequence number and asks it to
start up by itself (i.e. in bootstrap mode) and then asks the remaining nodes
to join the cluster.
In most cases, running the errand recovers your cluster.
But, certain scenarios require additional steps.
### <a id="determine-failure"></a> Determine Type of Cluster Failure
To determine which set of instructions to follow, do the following:
1. Run the following command.
```
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT instances
```
Where:
- `YOUR-ENV` is the environment where you deployed the cluster.
- `YOUR-DEPLOYMENT` is the deployment cluster name.
For example:
<pre class="terminal">
$ bosh -e prod -d mysql instances
</pre>
1. Find and record the `Process State` for your MySQL instances.
<a id="vms"></a>In the following example output, the MySQL instances are in the `failing` process state.
<pre class="terminal">
Instance Process State AZ IPs
backup-restore/c635410e-917d-46aa-b054-86d222b6d1c0 running us-central1-b 10.0.4.47
bootstrap/a31af4ff-e1df-4ff1-a781-abc3c6320ed4 - us-central1-b -
broker-registrar/1a93e53d-af7c-4308-85d4-3b2b80d504e4 - us-central1-b 10.0.4.58
cf-mysql-broker/137d52b8-a1b0-41f3-847f-c44f51f87728 running us-central1-c 10.0.4.57
cf-mysql-broker/28b463b1-cc12-42bf-b34b-82ca7c417c41 running us-central1-b 10.0.4.56
deregister-and-purge-instances/4cb93432-4d90-4f1d-8152-d0c238fa5aab - us-central1-b -
monitoring/f7117dcb-1c22-495e-a99e-cf2add90dea9 running us-central1-b 10.0.4.48
mysql/220fe72a-9026-4e2e-9fe3-1f5c0b6bf09b failing us-central1-b 10.0.4.44
mysql/28a210ac-cb98-4ab4-9672-9f4c661c57b8 failing us-central1-f 10.0.4.46
mysql/c1639373-26a2-44ce-85db-c9fe5a42964b failing us-central1-c 10.0.4.45
proxy/87c5683d-12f5-426c-b925-62521529f64a running us-central1-b 10.0.4.60
proxy/b0115ccd-7973-42d3-b6de-edb5ae53c63e running us-central1-c 10.0.4.61
rejoin-unsafe/8ce9370a-e86b-4638-bf76-e103f858413f - us-central1-b -
smoke-tests/e026aaef-efd9-4644-8d14-0811cb1ba733 - us-central1-b 10.0.4.59
</pre>
1. Choose your scenario:
* If your MySQL instances are in the `failing` state, continue to
[Scenario 1](#cluster-disrupted).
* If your MySQL instances are in the ` - ` state, continue to
[Scenario 2](#vms-terminated).
### <a id="cluster-disrupted"></a>Scenario 1: VMs Running, Cluster Disrupted ##
In this scenario, the VMs are running, but the cluster has been disrupted.
To bootstrap in this scenario, follow these steps:
<!-- Ask a person about this. Is this step necessary? -->
<!-- 1. To select the correct deployment, run the following command: -->
<!--
```
bosh deployment PATH-TO-DEPLOYMENT-MANIFEST
```
-->
<!-- Where `PATH-TO-DEPLOYMENT-MANIFEST` is the path to your deployment -->
<!-- manifest. -->
1. To run the bootstrap errand, run the following command:
```
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT run-errand bootstrap
```
Where:
+ `YOUR-ENV` is the name of your environment.
+ `YOUR-DEPLOYMENT` is the name of your deployment.
<p class='note'><strong>Note:</strong> The errand runs for a long time, during which no output is returned.</p>
The command returns many lines of output, eventually followed by:
<pre class="terminal">
Bootstrap errand completed
[stderr]
+ echo 'Started bootstrap errand ...'
+ JOB\_DIR=/var/vcap/jobs/bootstrap
+ CONFIG\_PATH=/var/vcap/jobs/bootstrap/config/config.yml
+ /var/vcap/packages/bootstrap/bin/cf-mysql-bootstrap -configPath=/var/vcap/jobs/bootstrap/config/config.yml
+ echo 'Bootstrap errand completed'
+ exit 0
Errand 'bootstrap' completed successfully (exit code 0)
</pre>
1. If the errand fails, run the bootstrap errand command again after a
few minutes. The bootstrap errand might not work the first time.
1. If the errand fails after several tries, bootstrap your cluster manually.
See [Bootstrap Manually](#manual-bootstrap) below.
### <a id="vms-terminated"></a>Scenario 2: VMs Terminated or Lost ##
In severe circumstances, such as a power failure, it is possible to lose all your VMs.
You must recreate them before you can begin recovering the cluster.
When MySQL instances are in the ` - ` state, the VMs are lost. The procedures in
this scenario bring the instances from a ` - ` state to a `failing` state.
Then you run the bootstrap errand similar to [Scenario 1](#cluster-disrupted)
above and restore configuration.
To recover terminated or lost VMs, do the procedures in the sections below:
1. [Recreate the Missing VMs](#recreate): Bring MySQL instances from a ` - `
state to a `failing` state.
1. [Run the Bootstrap Errand](#run-the-errand): Since your instances are now
in the `failing` state, you continue similarly to
[Scenario 1](#cluster-disrupted) above.
1. [Restore the BOSH Configuration](#reset-deployment): Go back to unignoring all
instances and redeploy. This is a critical and mandatory step.
<p class="warning note"><strong>WARNING:</strong> If you do not set each of your ignored instances to <code>unignore</code>,
your instances are not updated in future deploys. You must perform the procedure in the final section of <em>Scenario 2</em>,
<a href="#reset-deployment">Restore the BOSH Configuration</a>.</p>
#### <a id="recreate"></a> Recreate the Missing VMs
The procedure in this section uses BOSH to recreate the
VMs, install software on them, and try to start the jobs.
The procedure below allows you to do the following:
- Redeploy your cluster while expecting the jobs to fail.
- Instruct BOSH to ignore the state of each instance in your cluster. This allows
BOSH to deploy the software to each instance even if the instance is failing.
To recreate your missing VMs, do the following:
<!-- node or bosh vm -->
1. To SSH into the BOSH VMs, follow the procedure in
[BOSH SSH](https://docs.pivotal.io/pivotalcf/2-3/customizing/trouble-advanced.html#bosh-ssh).
1. If BOSH resurrection is enabled, disable it by running the following command:
```
bosh -e YOUR-ENV update-resurrection off
```
Where `YOUR-ENV` is the name of your environment.
1. To download the current manifest, run the following command:
```
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT manifest > /tmp/manifest.yml
```
Where:
+ `YOUR-ENV` is the name of your environment.
+ `YOUR-DEPLOYMENT` is the name of your deployment.
1. To redeploy deployment, run the following command:
```
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT deploy /tmp/manifest.yml
```
Where:
+ `YOUR-ENV` is the name of your environment.
+ `YOUR-DEPLOYMENT` is the name of your deployment.
<p class="note"><strong>Note:</strong> Expect one of the MySQL VMs to fail. Deploying causes BOSH
to create new VMs and install the software. Forming a cluster is in a
subsequent step.</p>
1. To view the instance GUID of the VM that attempted to start,
run the following command:
```
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT instances
```
Where:
+ `YOUR-ENV` is the name of your environment.
+ `YOUR-DEPLOYMENT` is the name of your deployment.
Record the instance GUID, which is the string after `mysql/` in your
BOSH instances output.
1. To instruct BOSH to ignore each MySQL VM, run the following command:
```
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT ignore mysql/INSTANCE-GUID
```
Where:
+ `YOUR-ENV` is the name of your environment.
+ `YOUR-DEPLOYMENT` is the name of your deployment.
+ `INSTANCE-GUID` is the GUID of your instance you recorded in the above step.
1. Repeat steps 4 through 6 until all instances have attempted to start.
1. If you disabled BOSH resurrection in step 2, to re-enable it,
run the following command:
```
bosh -e YOUR-ENV update-resurrection on
```
Where `YOUR-ENV` is the name of your environment.
1. To confirm that your MySQL instances have gone from the ` - ` state to
the `failing` state, run the following command
```
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT instances
```
Where:
+ `YOUR-ENV` is the name of your environment.
+ `YOUR-DEPLOYMENT` is the name of your deployment.
#### <a id="run-the-errand"></a>Run the Bootstrap Errand
After you recreate the VMs, all instances now have a `failing` process state and have the MySQL code.
You must run the bootstrap errand to recover the cluster.
To bootstrap, do the following:
1. To run the bootstrap errand, run the following command:
```
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT run-errand bootstrap
```
Where:
+ `YOUR-ENV` is the name of your environment.
+ `YOUR-DEPLOYMENT` is the name of your deployment.
<p class='note'><strong>Note:</strong> The errand runs for a long time,
during which no output is returned.</p>
The command returns many lines of output,
eventually with the following successful output:
<pre class="terminal">
Bootstrap errand completed
[stderr]
echo 'Started bootstrap errand ...'
JOB\_DIR=/var/vcap/jobs/bootstrap
CONFIG\_PATH=/var/vcap/jobs/bootstrap/config/config.yml
/var/vcap/packages/bootstrap/bin/cf-mysql-bootstrap -configPath=/var/vcap/jobs/bootstrap/config/config.yml
echo 'Bootstrap errand completed'
exit 0
Errand 'bootstrap' completed successfully (exit code 0)
</pre>
1. If the errand fails, run the bootstrap errand command again after a few minutes.
The bootstrap errand might not work immediately.
1. See that the errand completes successfully in the shell output and continue to
[Restore the BOSH Configuration](#reset-deployment) below.
<p class="note"><strong>Note:</strong> After you complete the bootstrap
errand, you might still see instances in the <code>failing</code> state.
Continue to the next section anyway.</p>
#### <a id="reset-deployment"></a>Restore the BOSH Configuration
<p class="note warning"><strong>WARNING:</strong> If you do not set each of your ignored instances
to <code>unignore</code>, your instances are never updated in future deploys.</p>
To restore your BOSH configuration to its previous state, this procedure
unignores each instance that was previously ignored:
1. For each ignored instance, run the following command:
```
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT unignore mysql/INSTANCE-GUID
```
Where:
+ `YOUR-ENV` is the name of your environment.
+ `YOUR-DEPLOYMENT` is the name of your deployment.
+ `INSTANCE-GUID` is the GUID of your instance.
1. To redeploy your deployment, run the following command:
```
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT deploy
```
Where:
+ `YOUR-ENV` is the name of your environment.
+ `YOUR-DEPLOYMENT` is the name of your deployment.
1. To validate that all `mysql` instances are in a `running` state, run the
following command:
```
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT instances
```
Where:
+ `YOUR-ENV` is the name of your environment.
+ `YOUR-DEPLOYMENT` is the name of your deployment.
## <a id="manual-bootstrap"></a>Bootstrap Manually
If the bootstrap errand is not able to automatically recover the cluster,
you might need to do the steps manually.
Do the procedures in the sections below to manually bootstrap your cluster.
<p class="note warning"><strong>WARNING</strong>: The following procedures are
prone to user-error
and can result in lost data if followed incorrectly.
Follow the procedure in <a href="#assisted-bootstrap">Bootstrap with the BOSH
Errand</a> above first,
and only resort to the manual process if the errand fails to repair the
cluster.</p>
### <a id="shut-down"></a>Shut Down MySQL
Do the following for each node in the cluster:
1. To SSH into the node, follow the procedure in
[BOSH SSH](https://docs.pivotal.io/pivotalcf/2-3/customizing/trouble-advanced.html#bosh-ssh).
1. To shut down the `mysqld` process on the node, run the following command:
```
monit stop galera-init
```
Re-bootstrapping the cluster is not successful unless you shut down the
`mysqld` process on all nodes in the cluster.
### <a id="choose-node"></a> Choose Node to Bootstrap
<% if vars.product_short == "PAS" %>
To avoid losing data, you must bootstrap from a node in the cluster that has
the highest transaction sequence number (`seqno`).
For each node in the cluster, do the following to find its `seqno`:
1. SSH into the node, using the procedure in
[BOSH SSH](https://docs.pivotal.io/pivotalcf/2-3/customizing/trouble-advanced.html#bosh-ssh).
1. Run the following command to find `seqno` values in the node's Galera state file, `grastate.dat`:
```
cat /var/vcap/store/pxc-mysql/grastate.dat | grep 'seqno:'
```
1. Do one of the following, based on the last and highest `seqno` value in the `grep` output above:
- **If `seqno` is a positive integer**, then the node shut down gracefully.
Record this number for comparison with the latest `seqno` of other nodes in the cluster.
- **If `seqno` is `-1`**, then node crashed or was killed.
Proceed as follows to possibly recover the `seqno` from the database:
1. Run the following command to temporarily start the database and append the last sequence number to its error log:
```
/var/vcap/packages/pxc/bin/mysqld --defaults-file=/var/vcap/jobs/pxc-mysql/config/my.cnf --wsrep-recover
```
1. Scan the end of the error log `mysql.err.log` for the recovered `seqno`.
It appears after the group ID (a UUID) and a colon.
For example, it is `15026` in the following output:
<pre class="terminal">
$ grep "Recovered position" /var/vcap/sys/log/pxc-mysql/mysql.err.log | tail -1
150225 18:09:42 mysqld_safe WSREP: Recovered position e93955c7-b797-11e4-9faa-9a6f0b73eb46:15026
</pre>
1. If the node never connected to the cluster before crashing, it may never have been
assigned a group ID. In this case, there is nothing to recover.
Do not choose this node for bootstrapping unless all the other nodes also nodes crashed.
4. After you determine the `seqno` for all nodes in your cluster, identify the one with the highest `seqno`.
If multiple nodes share the same highest `seqno`, and it is not `-1`, you can bootstrap from any of them.
<% else %>
To choose the node to bootstrap, you must find the node with the highest
transaction `sequence_number`.
For each node in the cluster do the following:
1. To SSH into the node, follow the procedure in
[BOSH SSH](https://docs.pivotal.io/pivotalcf/2-3/customizing/trouble-advanced.html#bosh-ssh).
1. To view the sequence number for a node, run the following command:
```
/var/vcap/jobs/mysql/bin/get-sequence-number
```
When prompted confirm that you want to stop MySQL.
For example:
<pre class="terminal">
$ /var/vcap/jobs/mysql/bin/get-sequence-number
This script stops mysql. Are you sure? (y/n): y
{"sequence_number":421,"instance_id":"012abcde-f34g-567h-ijk8-9123l4567891"}
</pre>
1. Record the value of `sequence_number`.
After determining the `sequence_number` for all nodes in your cluster,
identify the node with the highest `sequence_number`.
If all nodes have the same `sequence_number`, you can choose any node as the
new bootstrap node.
<% end %>
### <a id="set-bootstrap-node"></a>Bootstrap the First Node ###
After determining the node with the highest `sequence_number`, do the following
to bootstrap the node:
<p class="note"><strong>Note:</strong> Only run these bootstrap commands on the
node with the highest <code>sequence_number</code>.
Otherwise the node with the highest <code>sequence_number</code> is unable
to join the new cluster unless its data is abandoned.
Its <code>mysqld</code> process exits with an error.
For more information about intentionally abandoning data,
see <a href="https://docs.pivotal.io/p-mysql/1-10/architecture.html">Architecture</a>.</p>
1. To update the state file and restart the
<code>mysqld</code> process on the new bootstrap node, run the following commands:
<pre><code>echo -n "NEEDS_BOOTSTRAP" > /var/vcap/store/pxc-mysql/state.txt
monit start galera-init</code></pre>
1. It can take up to ten minutes for <code>monit</code> to start the
<code>mysqld </code> process.
To check if the <code>mysqld</code> process has started successfully,
run the following command:
<pre><code>watch monit summary</code></pre>
### <a id="restart-nodes"></a>Restart Remaining Nodes ###
After the bootstrapped node is running, to restart the nodes do the following:
1. To start the <code>mysqld</code> process with <code>monit</code>,
run the following command:
<pre><code>monit start galera-init</code></pre>
If the node is prevented from starting by the
[Interruptor](https://docs.pivotal.io/p-mysql/1-10/interruptor.html),
do the manual procedure to force the node to rejoin the cluster,
documented in
[How to Manually Force a MySQL Node to Rejoin if a Node Cannot Rejoin the Cluster for P-MySQL 1.9 and 1.10](https://community.pivotal.io/s/article/How-to-Manually-Force-a-MySQL-Node-to-Rejoin-if-a-Node-Cannot-Rejoin-the-Cluster-for-P-MySQL-19-and-110) in the Pivotal Support knowledge base.
<p class="note warning"><strong>WARNING</strong>: Forcing a node to rejoin
the cluster is a destructive procedure.
Only do the procedure with the assistance of
<a href="https://support.pivotal.io">Pivotal Support</a>.</p>
1. If the `monit start` command fails, it might be because the node with
the highest `sequence_number` is `mysql/0`. <br><br>
In this case, do the following:
1. To make BOSH ignore updating `mysql/0`, run the following command:
```
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT ignore mysql/0
```
Where:
+ `YOUR-ENV` is the name of your environment.
+ `YOUR-DEPLOYMENT` is the name of your deployment.
1. Navigate to Ops Manager in a browser, log in, and click **Apply Changes**.
1. When the deploy finishes, run the following command from the Ops Manager VM:
```
bosh -e YOUR-ENV -d YOUR-DEPLOYMENT unignore mysql/0
```
Where:
+ `YOUR-ENV` is the name of your environment.
+ `YOUR-DEPLOYMENT` is the name of your deployment.
1. To verify that the new nodes have successfully joined the cluster,
SSH into the bootstrap node and run the following command to output the
total number of nodes in the cluster:
<pre><code>mysql> SHOW STATUS LIKE 'wsrep_cluster_size';</code></pre>