-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster-manager node commissioning ansible task should timeout gracefully #176
Comments
the node was stuck on pvcreate, because there was a DOS partition on the device, and it was hanging on for user input 'to wipe it or not'. So assuming this would resolve the original issue of node commissioning getting stuck. But we do have 2 issues here -
|
cc @erikh to comment on Regarding |
Hm. If we can If not I'm not sure we should account for this problem. We could always wipe the block device with I'm torn on asking the user to ensure their environment is sane or correcting this for them with potentially dramatic consequences (by wiping the disk for them) |
@mapuri @erikh |
either way SGTM, but it's a trade. I think doing it automatically for I don't have a preference. On 25 Apr 2016, at 23:18, rkharya wrote:
|
While commissioning a node ansible task is stuck due to host level LVM issue. After reasonable time, this task should get timed out and node provisioing status should be dealt accordingly. Currently its waiting for more than 12 hours and still expecting, it can go through. Though at the node, all LVM commands are hanging...
Apr 24 22:53:35 Docker-1.cisco.com clusterm[8599]: level=info msg="TASK [docker : setup iptables for docker] _"
Apr 24 22:53:35 Docker-1.cisco.com clusterm[8599]: level=info msg="changed: [Docker-2-FLM19379EU8] => (item=2385)"
Apr 24 22:53:35 Docker-1.cisco.com clusterm[8599]: level=info msg=
Apr 24 22:53:35 Docker-1.cisco.com clusterm[8599]: level=info msg="TASK [docker : copy systemd units for docker(enable cluster store) (debian)] "
Apr 24 22:53:35 Docker-1.cisco.com clusterm[8599]: level=info msg="skipping: [Docker-2-FLM19379EU8]"
Apr 24 22:53:35 Docker-1.cisco.com clusterm[8599]: level=info msg=
Apr 24 22:53:35 Docker-1.cisco.com clusterm[8599]: level=info msg="TASK [docker : copy systemd units for docker(enable cluster store) (redhat)] *"
Apr 24 22:53:35 Docker-1.cisco.com clusterm[8599]: level=info msg="changed: [Docker-2-FLM19379EU8]"
Apr 24 22:53:35 Docker-1.cisco.com clusterm[8599]: level=info msg=
Apr 24 22:53:35 Docker-1.cisco.com clusterm[8599]: level=info msg="TASK [docker : check docker-tcp socket state] _"
Apr 24 22:53:35 Docker-1.cisco.com clusterm[8599]: level=info msg="changed: [Docker-2-FLM19379EU8]"
Apr 24 22:53:35 Docker-1.cisco.com clusterm[8599]: level=info msg=
Apr 24 22:53:36 Docker-1.cisco.com clusterm[8599]: level=info msg="TASK [docker : include] ***"
Apr 24 22:53:36 Docker-1.cisco.com clusterm[8599]: level=info msg="included: /home/cluster-admin/ansible/roles/docker/tasks/create_docker_device.yml for Docker-2-FLM19379EU8"
Apr 24 22:53:36 Docker-1.cisco.com clusterm[8599]: level=info msg=
Apr 24 22:53:36 Docker-1.cisco.com clusterm[8599]: level=info msg="TASK [docker : pvcreate check for /dev/sdb] **********************************"
Apr 24 22:53:36 Docker-1.cisco.com clusterm[8599]: level=info msg="fatal: [Docker-2-FLM19379EU8]: FAILED! => {"changed": true, "cmd": "pvdisplay /dev/sdb", "delta": "0:00:00.
085031", "end": "2016-04-24 22:53:36.311642", "failed": true, "rc": 5, "start": "2016-04-24 22:53:36.226611", "stderr": " Failed to find physical volume \"/dev/sdb
".", "stdout": "", "stdout_lines": [], "warnings": []}"
Apr 24 22:53:36 Docker-1.cisco.com clusterm[8599]: level=info msg=...ignoring
Apr 24 22:53:36 Docker-1.cisco.com clusterm[8599]: level=info msg=
Apr 24 22:53:36 Docker-1.cisco.com clusterm[8599]: level=info msg="TASK [docker : pvcreate /dev/sdb] ********************************_"
Apr 24 22:53:51 Docker-1.cisco.com clusterm[8599]: level=debug msg="[ansible-playbook -i /tmp/hosts505218005 --user cluster-admin --private-key /home/cluster-admin/.ssh/id_rsa --extra
-vars {"contiv_network_mode":"standalone","contiv_network_src_file":"http://sam-tools01/share/netplugin-pv0.1-04-21-2016.15-14-32.UTC.tar.bz2","contiv_network_version":"v0.
1-04-21-2016.15-14-32.UTC","control_interface":"enp6s0","docker_device":"/dev/sdb","docker_device_size":"100000MB","docker_version":"1.10.3","env":{"HTTPS_PROXY":
"http://64.102.255.40:8080","HTTP_PROXY":"http://64.102.255.40:8080","NO_PROXY":"","http_proxy":"http://64.102.255.40:8080","https_proxy":"http://64.102.255.40:8080"
,"no_proxy":""},"fwd_mode":"routing","netplugin_if":"enp7s0","scheduler_provider":"ucp-swarm","service_vip":"10.65.122.79","ucp_bootstrap_node_name":"Docker-2-F
LM19379EU8","ucp_license_file":"/home/cluster-admin/docker_subscription.lic","ucp_version":"1.0.4","validate_certs":"false"} /home/cluster-admin/ansible//site.yml](/usr/
bin/ansible-playbook) (pid: 12903) has been running for 7m0.006053611s"
....
...
Apr 25 13:45:52 Docker-1.cisco.com clusterm[8599]: level=debug msg="[ansible-playbook -i /tmp/hosts505218005 --user cluster-admin --private-key /home/cluster-admin/.ssh/id_rsa --extra
-vars {"contiv_network_mode":"standalone","contiv_network_src_file":"http://sam-tools01/share/netplugin-pv0.1-04-21-2016.15-14-32.UTC.tar.bz2","contiv_network_version":"v0.
1-04-21-2016.15-14-32.UTC","control_interface":"enp6s0","docker_device":"/dev/sdb","docker_device_size":"100000MB","docker_version":"1.10.3","env":{"HTTPS_PROXY":
"http://64.102.255.40:8080","HTTP_PROXY":"http://64.102.255.40:8080","NO_PROXY":"","http_proxy":"http://64.102.255.40:8080","https_proxy":"http://64.102.255.40:8080"
,"no_proxy":""},"fwd_mode":"routing","netplugin_if":"enp7s0","scheduler_provider":"ucp-swarm","service_vip":"10.65.122.79","ucp_bootstrap_node_name":"Docker-2-F
LM19379EU8","ucp_license_file":"/home/cluster-admin/docker_subscription.lic","ucp_version":"1.0.4","validate_certs":"false"} /home/cluster-admin/ansible//site.yml](/usr/
bin/ansible-playbook) (pid: 12903) has been running for _14h59m0.389982285s"
On the node which is getting commissioned LVM status is -
Apr 24 22:53:36 Docker-2 python: ansible-command Invoked with warn=True executable=None chdir=None _raw_params=pvdisplay /dev/sdb removes=None creates=None _uses_shell=True
Apr 24 22:53:36 Docker-2 python: ansible-command Invoked with warn=True executable=None chdir=None _raw_params=pvcreate /dev/sdb removes=None creates=None _uses_shell=True
Apr 24 22:53:36 Docker-2 kernel: sdb:
'pvs -a' is stuck and had to be interrupted -
[root@Docker-2 log]# pvs -a
^C Interrupted...
Giving up waiting for lock.
/run/lock/lvm/P_orphans: flock failed: Interrupted system call
Can't get lock for #orphans_lvm1
Cannot process volume group #orphans_lvm1
Interrupted...
Interrupted...
PV VG Fmt Attr PSize PFree
/dev/sda2 rhel lvm2 a-- 278.91g 60.00m
/dev/sdb is not listing out
So ideally, commissioning task should fail with appropriate error message.
The text was updated successfully, but these errors were encountered: