Wait for attempts, attempt automatic recovery #39

alexandruavadanii · 2018-09-23T14:26:16Z

This a collection of MaaS comissioning/deploy workarounds that we collected in OPNFV over the last year on a pool of diverse servers (x86_64, AArch64, different vendors/boards etc.).

I know maas.py is slowly being obsoleted, but this PR would help Fuel@OPFNV drop some ugly workarounds (e.g. [1, 2]) until maasng supports something similar to wait_for_* functions.

cc @mpolenchuk

[1] https://github.com/opnfv/fuel/blob/stable/fraser/mcp/config/states/maas#L20-L66
[2] https://gerrit.opnfv.org/gerrit/gitweb?p=fuel.git;a=commitdiff;h=33ac2d8a4e1cb5383f794c88754edc0492004992#patch1

epcim · 2018-09-24T13:23:02Z

next time, please squash commits, for internal Gerrit review, I do have to sqush them on my own.

alexandruavadanii · 2018-09-24T13:24:00Z

Sure thing, I thought I was making things easier by splitting them.

alexandruavadanii · 2018-11-07T16:46:50Z

Hi,
I just saw the comments in Gerrit [1], but I don't have an account there (nor do I know how to create one) to address them.
[1] https://gerrit.mcp.mirantis.com/#/c/26625/

alexandruavadanii · 2018-11-08T14:54:29Z

I tried to address the comments in Gerrit in this PR, at least till I find a way to create an account in MCP Gerrit.

1. maas.py: Extend wait_for states with timeout param Extend the wait_for states with a timeout parameter. The timeout value is taken from reclass pillar data if defined. Oterwise, the states use the default value. Based on Ting's PR [1], slightly refactored. 2. maas.py: Extend `req_status` support to multiple values Previously, req_status could be one of the MaaS status strings, e.g. 'Ready'. Extend matching to '|'-separated statuses (e.g. 'Ready|Deployed') to allow idempotency in MaaS machine commissioning and deployment cycles. Also provide a `maas.machines.wait_for_ready_or_deployed` sls. 3. maas.py: wait_for_*: Add attempts arg Introduce a new parameter that allows a maximum number of automatic recovery attempts for the common failures w/ machine operations. If not present in pillar data, it defaults to 0 (OFF). Common error states, possible cause and automatic recovery pattern: * New - usually indicates issues with BMC connectivity (no network route, but on rare occassions it happens due to MaaS API being flaky); - fix: delete the machine, (re)process machine definitions; * Failed commissioning - various causes, usually a simple retry works; - fix: delete the machine, (re)process machine definitions; * Failed testing - incompatible hardware, missing drivers etc. - usually consistent and board-specific; - fix: override failed testing * Allocated - on rare ocassions nodes get stuck in this state instead 'Deploy'; - fix: mark-broken, mark-fixed, if it failed at least once before perform a fio test (fixes another unrelated spurious issue with encrypted disks from previous deployments), (re)deploy machines; * Failed deployment - various causes, usually a simple retry works; - fix: same as for nodes stuck in 'Allocated'; [1] salt-formulas#34 Change-Id: Ifb7dd9f8fcfbbed557e47d8fdffb1f963604fb15 Signed-off-by: ting wu <[email protected]> Signed-off-by: Alexandru Avadanii <[email protected]>

alexandruavadanii force-pushed the wait_for-attempts branch 2 times, most recently from bde9ad6 to 7bc152c Compare November 8, 2018 14:47

alexandruavadanii force-pushed the wait_for-attempts branch from 7bc152c to 3a918eb Compare December 13, 2018 15:23

alexandruavadanii force-pushed the wait_for-attempts branch from 3a918eb to b8c10df Compare December 14, 2018 19:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait for attempts, attempt automatic recovery #39

Wait for attempts, attempt automatic recovery #39

alexandruavadanii commented Sep 23, 2018

epcim commented Sep 24, 2018

alexandruavadanii commented Sep 24, 2018

alexandruavadanii commented Nov 7, 2018

alexandruavadanii commented Nov 8, 2018

Wait for attempts, attempt automatic recovery #39

Are you sure you want to change the base?

Wait for attempts, attempt automatic recovery #39

Conversation

alexandruavadanii commented Sep 23, 2018

epcim commented Sep 24, 2018

alexandruavadanii commented Sep 24, 2018

alexandruavadanii commented Nov 7, 2018

alexandruavadanii commented Nov 8, 2018