Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for attempts, attempt automatic recovery #39

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

alexandruavadanii
Copy link
Contributor

This a collection of MaaS comissioning/deploy workarounds that we collected in OPNFV over the last year on a pool of diverse servers (x86_64, AArch64, different vendors/boards etc.).

I know maas.py is slowly being obsoleted, but this PR would help Fuel@OPFNV drop some ugly workarounds (e.g. [1, 2]) until maasng supports something similar to wait_for_* functions.

cc @mpolenchuk

[1] https://github.com/opnfv/fuel/blob/stable/fraser/mcp/config/states/maas#L20-L66
[2] https://gerrit.opnfv.org/gerrit/gitweb?p=fuel.git;a=commitdiff;h=33ac2d8a4e1cb5383f794c88754edc0492004992#patch1

@epcim
Copy link
Member

epcim commented Sep 24, 2018

next time, please squash commits, for internal Gerrit review, I do have to sqush them on my own.

@alexandruavadanii
Copy link
Contributor Author

Sure thing, I thought I was making things easier by splitting them.

@alexandruavadanii
Copy link
Contributor Author

Hi,
I just saw the comments in Gerrit [1], but I don't have an account there (nor do I know how to create one) to address them.
[1] https://gerrit.mcp.mirantis.com/#/c/26625/

@alexandruavadanii alexandruavadanii force-pushed the wait_for-attempts branch 2 times, most recently from bde9ad6 to 7bc152c Compare November 8, 2018 14:47
@alexandruavadanii
Copy link
Contributor Author

I tried to address the comments in Gerrit in this PR, at least till I find a way to create an account in MCP Gerrit.

1. maas.py: Extend wait_for states with timeout param

Extend the wait_for states with a timeout parameter.
The timeout value is taken from reclass pillar data if
defined. Oterwise, the states use the default value.
Based on Ting's PR [1], slightly refactored.

2. maas.py: Extend `req_status` support to multiple values

Previously, req_status could be one of the MaaS status strings, e.g.
'Ready'. Extend matching to '|'-separated statuses (e.g.
'Ready|Deployed') to allow idempotency in MaaS machine commissioning
and deployment cycles.

Also provide a `maas.machines.wait_for_ready_or_deployed` sls.

3. maas.py: wait_for_*: Add attempts arg

Introduce a new parameter that allows a maximum number of automatic
recovery attempts for the common failures w/ machine operations.
If not present in pillar data, it defaults to 0 (OFF).

Common error states, possible cause and automatic recovery pattern:
* New
  - usually indicates issues with BMC connectivity (no network route,
    but on rare occassions it happens due to MaaS API being flaky);
  - fix: delete the machine, (re)process machine definitions;
* Failed commissioning
  - various causes, usually a simple retry works;
  - fix: delete the machine, (re)process machine definitions;
* Failed testing
  - incompatible hardware, missing drivers etc.
  - usually consistent and board-specific;
  - fix: override failed testing
* Allocated
  - on rare ocassions nodes get stuck in this state instead 'Deploy';
  - fix: mark-broken, mark-fixed, if it failed at least once before
    perform a fio test (fixes another unrelated spurious issue with
    encrypted disks from previous deployments), (re)deploy machines;
* Failed deployment
  - various causes, usually a simple retry works;
  - fix: same as for nodes stuck in 'Allocated';

[1] salt-formulas#34

Change-Id: Ifb7dd9f8fcfbbed557e47d8fdffb1f963604fb15
Signed-off-by: ting wu <[email protected]>
Signed-off-by: Alexandru Avadanii <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants