|
| 1 | +Automatic repair triggered by sabakan |
| 2 | +===================================== |
| 3 | + |
| 4 | +[Sabakan][sabakan] is management software for server machines in a data center. |
| 5 | +It stores the status information of machines as well as their spec information. |
| 6 | +By referring to machines' status information in sabakan, CKE can initiate the repair of a non-healthy machine. |
| 7 | + |
| 8 | +This functionality is similar to [sabakan integration](sabakan-integration.md). |
| 9 | + |
| 10 | +How it works |
| 11 | +------------ |
| 12 | + |
| 13 | +CKE periodically queries sabakan to retrieve machines' status information in a data center. |
| 14 | +If CKE finds non-healthy machines, it creates [repair queue entries](repair.md) for those machines. |
| 15 | + |
| 16 | +The fields of a repair queue entry are determined based on the [information of the non-healthy machine](https://github.com/cybozu-go/sabakan/blob/main/docs/machine.md). |
| 17 | +* `address`: `.spec.ipv4[0]` |
| 18 | +* `machine_type`: `.spec.bmc.type` |
| 19 | +* `operation`: `.status.state` |
| 20 | + |
| 21 | +Users can configure the query to choose non-healthy machines. |
| 22 | +The queries are executed via sabakan [GraphQL `searchMachines`](https://github.com/cybozu-go/sabakan/blob/master/docs/graphql.md) API. |
| 23 | + |
| 24 | +Query |
| 25 | +----- |
| 26 | + |
| 27 | +CKE uses the following GraphQL query to retrieve machine information from sabakan. |
| 28 | + |
| 29 | +``` |
| 30 | +query ckeSearch($having: MachineParams, $notHaving: MachineParams) { |
| 31 | + searchMachines(having: $having, notHaving: $notHaving) { |
| 32 | + # snip |
| 33 | + } |
| 34 | +} |
| 35 | +``` |
| 36 | + |
| 37 | +The following values are used for `$having` and `$notHaving` variables by default. |
| 38 | +Users can change these values by [specifying a JSON object](ckecli.md#ckecli-auto-repair-set-variables-file). |
| 39 | + |
| 40 | +```json |
| 41 | +{ |
| 42 | + "having": { |
| 43 | + "states": ["UNHEALTHY", "UNREACHABLE"] |
| 44 | + }, |
| 45 | + "notHaving": { |
| 46 | + "roles": ["boot"] |
| 47 | + } |
| 48 | +} |
| 49 | +``` |
| 50 | + |
| 51 | +The type of `$having` and `$notHaving` is `MachineParams`. |
| 52 | +Consult [GraphQL schema][schema] for the definition of `MachineParams`. |
| 53 | + |
| 54 | +Enqueue limiters |
| 55 | +---------------- |
| 56 | + |
| 57 | +### Limiter for a single machine |
| 58 | + |
| 59 | +In order not to repeat repair operations too quickly for a single unstable machine, CKE checks recent repair queue entries before enqueueing. |
| 60 | +If it finds a recent entry for the machine in question, no matter whether the entry has finished or not, it refrains from creating an additional entry. |
| 61 | + |
| 62 | +CKE considers all persisting queue entries as "recent" for simplicity. |
| 63 | +A user should delete a finished repair queue entry for a machine once they consider the machine repaired. |
| 64 | +* If a repair queue entry has finished with success and a user considers the machine stable, they should delete the finished entry. |
| 65 | +* If a repair queue entry has finished with failure or a user considers the machine unstable, they should repair the machine manually. After the machine gets repaired, they should delete the finished entry. |
| 66 | + |
| 67 | +### Limiter for a cluster |
| 68 | + |
| 69 | +Sabakan may occasionally report false-positive non-healthy machines. |
| 70 | +If CKE believes all of the failure reports and initiates a lot of repair operations, the Kubernetes cluster will be stuck -- or worse, corrupted. |
| 71 | + |
| 72 | +Even when the failure reports are correct, it would be good for CKE to refrain from repairing too many machines. |
| 73 | +For example, the failure of many servers might be caused by the temporary power failure of a whole server rack. |
| 74 | +In that case, CKE should not mark the machines unrepairable as a result of pointless repair operations. |
| 75 | +Once the machines are marked unrepairable, sabakan will delete all data on those machines. |
| 76 | + |
| 77 | +In order not to initiate too many repair operations, CKE checks the number of recent repair queue entries plus the number of new failure reports before enqueueing. |
| 78 | +If it finds excessive numbers of entries/reports, no matter whether the entries have finished or not, it refrains from creating an additional entry. |
| 79 | + |
| 80 | +The maximum number of recent repair queue entries and new failure reports is [configurable](ckecli.md#ckecli-constraints-set-name-value) as a [constraint `maximum-repair-queue-entries`](constraints.md). |
| 81 | + |
| 82 | +As stated above, CKE considers all persisting queue entries as "recent" for simplicity. |
| 83 | + |
| 84 | + |
| 85 | +[sabakan]: https://github.com/cybozu-go/sabakan |
| 86 | +[schema]: https://github.com/cybozu-go/sabakan/blob/main/gql/graph/schema.graphqls |
0 commit comments