Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resources staying in STARTING/DELETING state #134

Open
praiskup opened this issue Jan 2, 2024 · 3 comments
Open

Resources staying in STARTING/DELETING state #134

praiskup opened this issue Jan 2, 2024 · 3 comments
Labels

Comments

@praiskup
Copy link
Owner

praiskup commented Jan 2, 2024

268847 - aws_aarch64_spot_prod_00268847_20231218_005905 pool=aws_aarch64_spot_prod tags= status=STARTING releases=0 ticket=NULL
401511 - aws_aarch64_normalreserved_prod_00401511_20231228_232914 pool=aws_aarch64_normalreserved_prod tags= status=STARTING releases=0 ticket=NULL

These machines are STARTING for multiple days. The fact that the allocator failed should be detected.

@praiskup praiskup added the bug label Jan 2, 2024
@praiskup
Copy link
Owner Author

This may happen in two situations:

  • allocator script fails, and disappears from ps aux -> I'm not sure why/how this can happen
  • the allocator script hangs... e.g. on indefinitely running ansible command (e.g. subscription manager, or alike)

@praiskup praiskup changed the title Resources staying in STARTING state Resources staying in STARTING/DELETING state Mar 1, 2024
@praiskup
Copy link
Owner Author

praiskup commented Mar 1, 2024

A similar thing happens when deleting OpenStack instances, from time to time, after (not 100% this is triggering the problem)

Traceback (most recent call last):
  File "/usr/lib/python3.12/site-packages/resalloc_openstack/helpers.py", line 74, in best_effort_delete
    self.delete()
  File "/usr/lib/python3.12/site-packages/resalloc_openstack/helpers.py", line 184, in delete
    self.nova_o.detach()
  File "/usr/lib/python3.12/site-packages/cinderclient/v3/volumes_base.py", line 69, in detach
    return self.manager.detach(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/cinderclient/v3/volumes_base.py", line 285, in detach
    return self._action('os-detach', volume,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/cinderclient/v3/volumes_base.py", line 257, in _action
    resp, body = self.api.client.post(url, body=body)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/cinderclient/client.py", line 223, in post
    return self._cs_request(url, 'POST', **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/cinderclient/client.py", line 211, in _cs_request
    return self.request(url, method, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/cinderclient/client.py", line 197, in request
    raise exceptions.from_response(resp, body)
cinderclient.exceptions.ClientException: The server has either erred or is incapable of performing the requested operation. (HTTP 500) (Request-ID: req-1e419934-999b-4256-a07e-a6d5e369b9c5)
failed to delete in #1 attempt

@praiskup
Copy link
Owner Author

praiskup commented Mar 1, 2024

No, that would be different, I'm not sure what happened, starting of the instance in DELETING state failed:

+ ansible-playbook init.yml -i 10.0.150.201,
ERROR! the playbook: init.yml could not be found
running cleanup
cleaning 05_copr_vm_production_psi_os_00544952_20240229_172705_1
cleaning 10_server
deleting server 9e95963e-642b-41f3-b771-82411eba2386
Traceback (most recent call last):
  File "/usr/bin/resalloc-openstack-new", line 22, in <module>
    main() 
  File "/usr/lib/python3.12/site-packages/resalloc_openstack/new/main.py", line 131, in main
    check_call(args.command, env=env, shell=True, stdin=DEVNULL)
  File "/usr/lib64/python3.12/subprocess.py", line 413, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'set -x ; ansible-playbook init.yml -i "$RESALLOC_OS_IP," >&2 </dev/null' returned non-zero exit status 1.

... probably stayed in STARTING becuase of this bug. Then I restarted resalloc, and it stayed in DELETING state after:

=== /var/log/resallocserver/hooks/544952_terminate ===
initializing <class 'resalloc_openstack.helpers.Server'>
vm copr_vm_production_psi_os_00544952_20240229_172705 not found
initializing <class 'resalloc_openstack.helpers.Server'>
vm copr_vm_production_psi_os_00544952_20240229_172705 not found
initializing <class 'resalloc_openstack.helpers.Server'>
vm copr_vm_production_psi_os_00544952_20240229_172705 not found

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant