Skip to content

Commit 6370e2e

Browse files
Merge pull request #725 from cybozu-go/sabakan-repair
Add sabakan-triggered automatic repair functionality
2 parents c78c22e + 177b5f2 commit 6370e2e

24 files changed

+736
-58
lines changed

CHANGELOG.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,14 @@ This project employs a versioning scheme described in [RELEASE.md](RELEASE.md#ve
55

66
## [Unreleased]
77

8+
### Added
9+
10+
- Add sabakan-triggered automatic repair functionality in [#725](https://github.com/cybozu-go/cke/pull/725)
11+
12+
### Fixed
13+
14+
- Fix not to send unassigned query parameters in Sabakan integration in [#725](https://github.com/cybozu-go/cke/pull/725)
15+
816
## [1.28.0]
917

1018
### Changed

constraints.go

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ type Constraints struct {
88
MinimumWorkers int `json:"minimum-workers"`
99
MaximumWorkers int `json:"maximum-workers"`
1010
RebootMaximumUnreachable int `json:"maximum-unreachable-nodes-for-reboot"`
11+
MaximumRepairs int `json:"maximum-repair-queue-entries"`
1112
}
1213

1314
// Check checks the cluster satisfies the constraints
@@ -41,5 +42,6 @@ func DefaultConstraints() *Constraints {
4142
MinimumWorkers: 1,
4243
MaximumWorkers: 0,
4344
RebootMaximumUnreachable: 0,
45+
MaximumRepairs: 0,
4446
}
4547
}

docs/ckecli.md

Lines changed: 28 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,11 @@ $ ckecli [--config FILE] <subcommand> args...
6767
- [`ckecli sabakan get-template`](#ckecli-sabakan-get-template)
6868
- [`ckecli sabakan set-variables FILE`](#ckecli-sabakan-set-variables-file)
6969
- [`ckecli sabakan get-variables`](#ckecli-sabakan-get-variables)
70+
- [`ckecli auto-repair`](#ckecli-auto-repair)
71+
- [`ckecli auto-repair enable|disable`](#ckecli-auto-repair-enabledisable)
72+
- [`ckecli auto-repair is-enabled`](#ckecli-auto-repair-is-enabled)
73+
- [`ckecli auto-repair set-variables FILE`](#ckecli-auto-repair-set-variables-file)
74+
- [`ckecli auto-repair get-variables`](#ckecli-auto-repair-get-variables)
7075
- [`ckecli status`](#ckecli-status)
7176

7277
## `ckecli cluster`
@@ -91,6 +96,7 @@ Set a constraint on the cluster configuration.
9196
- `minimum-workers`
9297
- `maximum-workers`
9398
- `maximum-unreachable-nodes-for-reboot`
99+
- `maximum-repair-queue-entries`
94100

95101
### `ckecli constraints show`
96102

@@ -408,12 +414,32 @@ Get the cluster configuration template.
408414

409415
### `ckecli sabakan set-variables FILE`
410416

411-
Set the query variables to search machines in sabakan.
417+
Set the query variables to search available machines in sabakan.
412418
`FILE` should contain JSON as described in [sabakan integration](sabakan-integration.md#variables).
413419

414420
### `ckecli sabakan get-variables`
415421

416-
Get the query variables to search machines in sabakan.
422+
Get the query variables to search available machines in sabakan.
423+
424+
## `ckecli auto-repair`
425+
426+
### `ckecli auto-repair enable|disable`
427+
428+
Enable/Disable [sabakan-triggered automatic repair](sabakan-triggered-repair.md).
429+
430+
### `ckecli auto-repair is-enabled`
431+
432+
Show sabakan-triggered automatic repair is enabled or disabled.
433+
It displays `true` or `false`.
434+
435+
### `ckecli auto-repair set-variables FILE`
436+
437+
Set the query variables to search non-healthy machines in sabakan.
438+
`FILE` should contain JSON as described in [sabakan-triggered automatic repair](sabakan-triggered-repair.md#query).
439+
440+
### `ckecli auto-repair get-variables`
441+
442+
Get the query variables to search non-healthy machines in sabakan.
417443

418444
## `ckecli status`
419445

docs/constraints.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,3 +12,4 @@ Cluster should satisfy these constraints.
1212
| `minimum-workers` | int | 1 | The minimum number of worker nodes |
1313
| `maximum-workers` | int | 0 | The maximum number of worker nodes. 0 means unlimited. |
1414
| `maximum-unreachable-nodes-for-reboot` | int | 0 | The maximum number of unreachable nodes allowed for operating reboot. |
15+
| `maximum-repair-queue-entries` | int | 0 | The maximum number of repair queue entries |

docs/sabakan-integration.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -345,7 +345,7 @@ Following Machine fields are translated to Node annotations:
345345

346346

347347
[sabakan]: https://github.com/cybozu-go/sabakan
348-
[schema]: https://github.com/cybozu-go/sabakan/blob/master/gql/schema.graphql
348+
[schema]: https://github.com/cybozu-go/sabakan/blob/main/gql/graph/schema.graphqls
349349
[taint]: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
350-
[lifecycle]: https://github.com/cybozu-go/sabakan/blob/master/docs/lifecycle.md#transition-diagram
350+
[lifecycle]: https://github.com/cybozu-go/sabakan/blob/main/docs/lifecycle.md#transition-diagram
351351
[well-known taints]: https://kubernetes.io/docs/reference/labels-annotations-taints/

docs/sabakan-triggered-repair.md

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
Automatic repair triggered by sabakan
2+
=====================================
3+
4+
[Sabakan][sabakan] is management software for server machines in a data center.
5+
It stores the status information of machines as well as their spec information.
6+
By referring to machines' status information in sabakan, CKE can initiate the repair of a non-healthy machine.
7+
8+
This functionality is similar to [sabakan integration](sabakan-integration.md).
9+
10+
How it works
11+
------------
12+
13+
CKE periodically queries sabakan to retrieve machines' status information in a data center.
14+
If CKE finds non-healthy machines, it creates [repair queue entries](repair.md) for those machines.
15+
16+
The fields of a repair queue entry are determined based on the [information of the non-healthy machine](https://github.com/cybozu-go/sabakan/blob/main/docs/machine.md).
17+
* `address`: `.spec.ipv4[0]`
18+
* `machine_type`: `.spec.bmc.type`
19+
* `operation`: `.status.state`
20+
21+
Users can configure the query to choose non-healthy machines.
22+
The queries are executed via sabakan [GraphQL `searchMachines`](https://github.com/cybozu-go/sabakan/blob/master/docs/graphql.md) API.
23+
24+
Query
25+
-----
26+
27+
CKE uses the following GraphQL query to retrieve machine information from sabakan.
28+
29+
```
30+
query ckeSearch($having: MachineParams, $notHaving: MachineParams) {
31+
searchMachines(having: $having, notHaving: $notHaving) {
32+
# snip
33+
}
34+
}
35+
```
36+
37+
The following values are used for `$having` and `$notHaving` variables by default.
38+
Users can change these values by [specifying a JSON object](ckecli.md#ckecli-auto-repair-set-variables-file).
39+
40+
```json
41+
{
42+
"having": {
43+
"states": ["UNHEALTHY", "UNREACHABLE"]
44+
},
45+
"notHaving": {
46+
"roles": ["boot"]
47+
}
48+
}
49+
```
50+
51+
The type of `$having` and `$notHaving` is `MachineParams`.
52+
Consult [GraphQL schema][schema] for the definition of `MachineParams`.
53+
54+
Enqueue limiters
55+
----------------
56+
57+
### Limiter for a single machine
58+
59+
In order not to repeat repair operations too quickly for a single unstable machine, CKE checks recent repair queue entries before enqueueing.
60+
If it finds a recent entry for the machine in question, no matter whether the entry has finished or not, it refrains from creating an additional entry.
61+
62+
CKE considers all persisting queue entries as "recent" for simplicity.
63+
A user should delete a finished repair queue entry for a machine once they consider the machine repaired.
64+
* If a repair queue entry has finished with success and a user considers the machine stable, they should delete the finished entry.
65+
* If a repair queue entry has finished with failure or a user considers the machine unstable, they should repair the machine manually. After the machine gets repaired, they should delete the finished entry.
66+
67+
### Limiter for a cluster
68+
69+
Sabakan may occasionally report false-positive non-healthy machines.
70+
If CKE believes all of the failure reports and initiates a lot of repair operations, the Kubernetes cluster will be stuck -- or worse, corrupted.
71+
72+
Even when the failure reports are correct, it would be good for CKE to refrain from repairing too many machines.
73+
For example, the failure of many servers might be caused by the temporary power failure of a whole server rack.
74+
In that case, CKE should not mark the machines unrepairable as a result of pointless repair operations.
75+
Once the machines are marked unrepairable, sabakan will delete all data on those machines.
76+
77+
In order not to initiate too many repair operations, CKE checks the number of recent repair queue entries plus the number of new failure reports before enqueueing.
78+
If it finds excessive numbers of entries/reports, no matter whether the entries have finished or not, it refrains from creating an additional entry.
79+
80+
The maximum number of recent repair queue entries and new failure reports is [configurable](ckecli.md#ckecli-constraints-set-name-value) as a [constraint `maximum-repair-queue-entries`](constraints.md).
81+
82+
As stated above, CKE considers all persisting queue entries as "recent" for simplicity.
83+
84+
85+
[sabakan]: https://github.com/cybozu-go/sabakan
86+
[schema]: https://github.com/cybozu-go/sabakan/blob/main/gql/graph/schema.graphqls

mtest/ckecli_test.go

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -137,4 +137,13 @@ func testCKECLI() {
137137
ckecliSafe("sabakan", "enable")
138138
ckecliSafe("sabakan", "get-url")
139139
})
140+
141+
It("should invoke auto-repair subcommand successfully", func() {
142+
ckecliSafe("auto-repair", "is-enabled")
143+
ckecliSafe("auto-repair", "disable")
144+
ckecliSafe("auto-repair", "enable")
145+
f := remoteTempFile(`{"having":{"states":["UNHEALTHY","UNREACHABLE"]},"notHaving":{"roles":["boot"]}}`)
146+
ckecliSafe("auto-repair", "set-variables", f)
147+
ckecliSafe("auto-repair", "get-variables")
148+
})
140149
}

pkg/ckecli/cmd/auto_repair.go

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
package cmd
2+
3+
import (
4+
"github.com/spf13/cobra"
5+
)
6+
7+
// autoRepairCmd represents the auto-repair command
8+
var autoRepairCmd = &cobra.Command{
9+
Use: "auto-repair",
10+
Short: "auto-repair subcommand",
11+
Long: `auto-repair subcommand`,
12+
}
13+
14+
func init() {
15+
rootCmd.AddCommand(autoRepairCmd)
16+
}

pkg/ckecli/cmd/auto_repair_disable.go

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
package cmd
2+
3+
import (
4+
"context"
5+
6+
"github.com/cybozu-go/well"
7+
"github.com/spf13/cobra"
8+
)
9+
10+
var autoRepairDisableCmd = &cobra.Command{
11+
Use: "disable",
12+
Short: "disable sabakan-triggered automatic repair",
13+
Long: `Disable sabakan-triggered automatic repair.`,
14+
15+
RunE: func(cmd *cobra.Command, args []string) error {
16+
well.Go(func(ctx context.Context) error {
17+
return storage.EnableAutoRepair(ctx, false)
18+
})
19+
well.Stop()
20+
return well.Wait()
21+
},
22+
}
23+
24+
func init() {
25+
autoRepairCmd.AddCommand(autoRepairDisableCmd)
26+
}

pkg/ckecli/cmd/auto_repair_enable.go

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
package cmd
2+
3+
import (
4+
"context"
5+
6+
"github.com/cybozu-go/well"
7+
"github.com/spf13/cobra"
8+
)
9+
10+
var autoRepairEnableCmd = &cobra.Command{
11+
Use: "enable",
12+
Short: "enable sabakan-triggered automatic repair",
13+
Long: `Enable sabakan-triggered automatic repair.`,
14+
15+
RunE: func(cmd *cobra.Command, args []string) error {
16+
well.Go(func(ctx context.Context) error {
17+
return storage.EnableAutoRepair(ctx, true)
18+
})
19+
well.Stop()
20+
return well.Wait()
21+
},
22+
}
23+
24+
func init() {
25+
autoRepairCmd.AddCommand(autoRepairEnableCmd)
26+
}

0 commit comments

Comments
 (0)