-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait for attempts, attempt automatic recovery #39
base: master
Are you sure you want to change the base?
Wait for attempts, attempt automatic recovery #39
Conversation
next time, please squash commits, for internal Gerrit review, I do have to sqush them on my own. |
Sure thing, I thought I was making things easier by splitting them. |
Hi, |
bde9ad6
to
7bc152c
Compare
I tried to address the comments in Gerrit in this PR, at least till I find a way to create an account in MCP Gerrit. |
7bc152c
to
3a918eb
Compare
1. maas.py: Extend wait_for states with timeout param Extend the wait_for states with a timeout parameter. The timeout value is taken from reclass pillar data if defined. Oterwise, the states use the default value. Based on Ting's PR [1], slightly refactored. 2. maas.py: Extend `req_status` support to multiple values Previously, req_status could be one of the MaaS status strings, e.g. 'Ready'. Extend matching to '|'-separated statuses (e.g. 'Ready|Deployed') to allow idempotency in MaaS machine commissioning and deployment cycles. Also provide a `maas.machines.wait_for_ready_or_deployed` sls. 3. maas.py: wait_for_*: Add attempts arg Introduce a new parameter that allows a maximum number of automatic recovery attempts for the common failures w/ machine operations. If not present in pillar data, it defaults to 0 (OFF). Common error states, possible cause and automatic recovery pattern: * New - usually indicates issues with BMC connectivity (no network route, but on rare occassions it happens due to MaaS API being flaky); - fix: delete the machine, (re)process machine definitions; * Failed commissioning - various causes, usually a simple retry works; - fix: delete the machine, (re)process machine definitions; * Failed testing - incompatible hardware, missing drivers etc. - usually consistent and board-specific; - fix: override failed testing * Allocated - on rare ocassions nodes get stuck in this state instead 'Deploy'; - fix: mark-broken, mark-fixed, if it failed at least once before perform a fio test (fixes another unrelated spurious issue with encrypted disks from previous deployments), (re)deploy machines; * Failed deployment - various causes, usually a simple retry works; - fix: same as for nodes stuck in 'Allocated'; [1] salt-formulas#34 Change-Id: Ifb7dd9f8fcfbbed557e47d8fdffb1f963604fb15 Signed-off-by: ting wu <[email protected]> Signed-off-by: Alexandru Avadanii <[email protected]>
3a918eb
to
b8c10df
Compare
This a collection of MaaS comissioning/deploy workarounds that we collected in OPNFV over the last year on a pool of diverse servers (x86_64, AArch64, different vendors/boards etc.).
I know maas.py is slowly being obsoleted, but this PR would help Fuel@OPFNV drop some ugly workarounds (e.g. [1, 2]) until maasng supports something similar to wait_for_* functions.
cc @mpolenchuk
[1] https://github.com/opnfv/fuel/blob/stable/fraser/mcp/config/states/maas#L20-L66
[2] https://gerrit.opnfv.org/gerrit/gitweb?p=fuel.git;a=commitdiff;h=33ac2d8a4e1cb5383f794c88754edc0492004992#patch1