cluster-manager node commissioning ansible task should timeout gracefully #176

rkharya · 2016-04-25T08:22:59Z

While commissioning a node ansible task is stuck due to host level LVM issue. After reasonable time, this task should get timed out and node provisioing status should be dealt accordingly. Currently its waiting for more than 12 hours and still expecting, it can go through. Though at the node, all LVM commands are hanging...

Apr 24 22:53:35 Docker-1.cisco.com clusterm[8599]: level=info msg="TASK [docker : setup iptables for docker] _"
Apr 24 22:53:35 Docker-1.cisco.com clusterm[8599]: level=info msg="changed: [Docker-2-FLM19379EU8] => (item=2385)"
Apr 24 22:53:35 Docker-1.cisco.com clusterm[8599]: level=info msg=
Apr 24 22:53:35 Docker-1.cisco.com clusterm[8599]: level=info msg="TASK [docker : copy systemd units for docker(enable cluster store) (debian)] "
Apr 24 22:53:35 Docker-1.cisco.com clusterm[8599]: level=info msg="skipping: [Docker-2-FLM19379EU8]"
Apr 24 22:53:35 Docker-1.cisco.com clusterm[8599]: level=info msg=
Apr 24 22:53:35 Docker-1.cisco.com clusterm[8599]: level=info msg="TASK [docker : copy systemd units for docker(enable cluster store) (redhat)] *"
Apr 24 22:53:35 Docker-1.cisco.com clusterm[8599]: level=info msg="changed: [Docker-2-FLM19379EU8]"
Apr 24 22:53:35 Docker-1.cisco.com clusterm[8599]: level=info msg=
Apr 24 22:53:35 Docker-1.cisco.com clusterm[8599]: level=info msg="TASK [docker : check docker-tcp socket state] _"
Apr 24 22:53:35 Docker-1.cisco.com clusterm[8599]: level=info msg="changed: [Docker-2-FLM19379EU8]"
Apr 24 22:53:35 Docker-1.cisco.com clusterm[8599]: level=info msg=
Apr 24 22:53:36 Docker-1.cisco.com clusterm[8599]: level=info msg="TASK [docker : include] ***"
Apr 24 22:53:36 Docker-1.cisco.com clusterm[8599]: level=info msg="included: /home/cluster-admin/ansible/roles/docker/tasks/create_docker_device.yml for Docker-2-FLM19379EU8"
Apr 24 22:53:36 Docker-1.cisco.com clusterm[8599]: level=info msg=
Apr 24 22:53:36 Docker-1.cisco.com clusterm[8599]: level=info msg="TASK [docker : pvcreate check for /dev/sdb] **********************************"
Apr 24 22:53:36 Docker-1.cisco.com clusterm[8599]: level=info msg="fatal: [Docker-2-FLM19379EU8]: FAILED! => {"changed": true, "cmd": "pvdisplay /dev/sdb", "delta": "0:00:00.
085031", "end": "2016-04-24 22:53:36.311642", "failed": true, "rc": 5, "start": "2016-04-24 22:53:36.226611", "stderr": " Failed to find physical volume \"/dev/sdb
".", "stdout": "", "stdout_lines": [], "warnings": []}"
Apr 24 22:53:36 Docker-1.cisco.com clusterm[8599]: level=info msg=...ignoring
Apr 24 22:53:36 Docker-1.cisco.com clusterm[8599]: level=info msg=
Apr 24 22:53:36 Docker-1.cisco.com clusterm[8599]: level=info msg="TASK [docker : pvcreate /dev/sdb] ********************************_"
Apr 24 22:53:51 Docker-1.cisco.com clusterm[8599]: level=debug msg="[ansible-playbook -i /tmp/hosts505218005 --user cluster-admin --private-key /home/cluster-admin/.ssh/id_rsa --extra
-vars {"contiv_network_mode":"standalone","contiv_network_src_file":"http://sam-tools01/share/netplugin-pv0.1-04-21-2016.15-14-32.UTC.tar.bz2","contiv_network_version":"v0.
1-04-21-2016.15-14-32.UTC","control_interface":"enp6s0","docker_device":"/dev/sdb","docker_device_size":"100000MB","docker_version":"1.10.3","env":{"HTTPS_PROXY":
"http://64.102.255.40:8080","HTTP_PROXY":"http://64.102.255.40:8080","NO_PROXY":"","http_proxy":"http://64.102.255.40:8080","https_proxy":"http://64.102.255.40:8080"
,"no_proxy":""},"fwd_mode":"routing","netplugin_if":"enp7s0","scheduler_provider":"ucp-swarm","service_vip":"10.65.122.79","ucp_bootstrap_node_name":"Docker-2-F
LM19379EU8","ucp_license_file":"/home/cluster-admin/docker_subscription.lic","ucp_version":"1.0.4","validate_certs":"false"} /home/cluster-admin/ansible//site.yml](/usr/
bin/ansible-playbook) (pid: 12903) has been running for 7m0.006053611s"

....
...
Apr 25 13:45:52 Docker-1.cisco.com clusterm[8599]: level=debug msg="[ansible-playbook -i /tmp/hosts505218005 --user cluster-admin --private-key /home/cluster-admin/.ssh/id_rsa --extra
-vars {"contiv_network_mode":"standalone","contiv_network_src_file":"http://sam-tools01/share/netplugin-pv0.1-04-21-2016.15-14-32.UTC.tar.bz2","contiv_network_version":"v0.
1-04-21-2016.15-14-32.UTC","control_interface":"enp6s0","docker_device":"/dev/sdb","docker_device_size":"100000MB","docker_version":"1.10.3","env":{"HTTPS_PROXY":
"http://64.102.255.40:8080","HTTP_PROXY":"http://64.102.255.40:8080","NO_PROXY":"","http_proxy":"http://64.102.255.40:8080","https_proxy":"http://64.102.255.40:8080"
,"no_proxy":""},"fwd_mode":"routing","netplugin_if":"enp7s0","scheduler_provider":"ucp-swarm","service_vip":"10.65.122.79","ucp_bootstrap_node_name":"Docker-2-F
LM19379EU8","ucp_license_file":"/home/cluster-admin/docker_subscription.lic","ucp_version":"1.0.4","validate_certs":"false"} /home/cluster-admin/ansible//site.yml](/usr/
bin/ansible-playbook) (pid: 12903) has been running for _14h59m0.389982285s"

On the node which is getting commissioned LVM status is -

Apr 24 22:53:36 Docker-2 python: ansible-command Invoked with warn=True executable=None chdir=None _raw_params=pvdisplay /dev/sdb removes=None creates=None _uses_shell=True
Apr 24 22:53:36 Docker-2 python: ansible-command Invoked with warn=True executable=None chdir=None _raw_params=pvcreate /dev/sdb removes=None creates=None _uses_shell=True
Apr 24 22:53:36 Docker-2 kernel: sdb:

'pvs -a' is stuck and had to be interrupted -

[root@Docker-2 log]# pvs -a
^C Interrupted...
Giving up waiting for lock.
/run/lock/lvm/P_orphans: flock failed: Interrupted system call
Can't get lock for #orphans_lvm1
Cannot process volume group #orphans_lvm1
Interrupted...
Interrupted...
PV VG Fmt Attr PSize PFree
/dev/sda2 rhel lvm2 a-- 278.91g 60.00m

/dev/sdb is not listing out

So ideally, commissioning task should fail with appropriate error message.

rkharya · 2016-04-25T08:34:28Z

the node was stuck on pvcreate, because there was a DOS partition on the device, and it was hanging on for user input 'to wipe it or not'. So assuming this would resolve the original issue of node commissioning getting stuck. But we do have 2 issues here -

In case of any such error our node commissioning tasks should get error out gracefully after sufficient time allowance
We have to take care of such cases in our ansible task, if the device has some partition and pvcreate is asking for user input, to wipe it or not.

mapuri · 2016-04-25T16:24:54Z

cc @erikh to comment on 2), there might be non-interactive mode of pvcreate commands that fails gracefully.

@rkharya

Regarding 1), we can set ansible task timeouts but timeout based handling is always tricky as you never know how long a commands is going to take. I think a reasonable fix might be to address 2) above, then give user an ability to check ansible-logs i.e. fix contiv-experimental/cluster#103 and if user sees a task getting stuck they should be able to interrupt ansible run i.e. fix contiv-experimental/cluster#63. And then just rerun ansible with the issue fixed. Does that sound reasonable?

erikh · 2016-04-25T17:25:45Z

Hm. If we can -y it somehow, great.

If not I'm not sure we should account for this problem. We could always wipe the block device with dd or the like, but that seems like something that has only a little benefit for this use-case specifically and may cause other trouble.

I'm torn on asking the user to ensure their environment is sane or correcting this for them with potentially dramatic consequences (by wiping the disk for them)

rkharya · 2016-04-26T06:18:58Z

@mapuri
Yes. this approach is good to go.

@erikh
So i guess this should be a documented scenario - asking user to have a clean disk as a pre-req before ansible will take it for docker_device creation routine. Because he knows better which disk suppose to be used.
On a side note, this also sounds reasonable if he already has identified device for docker_device use, he is aware of, anything on it will be purged and a new physical volume will be created. So we providing '-y' to pvcreate should not be of much concern here. Thoughts?

erikh · 2016-04-26T08:33:41Z

either way SGTM, but it's a trade. I think doing it automatically for
them can cause great sadness if they get it wrong.

I don't have a preference.

On 25 Apr 2016, at 23:18, rkharya wrote:

@mapuri
Yes. this approach is good to go.

@erikh
So i guess this should be a documented scenario - asking user to have
a clean disk as a pre-req before ansible will take it for
docker_device creation routine. Because he knows better which disk
suppose to be used.
On a side note, this also sounds reasonable if he already has
identified device for docker_device use, he is aware of, anything on
it will be purged and a new physical volume will be created. So we
providing '-y' to pvcreate should not be of much concern here.
Thoughts?

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#176 (comment)

mapuri modified the milestones: 0.1, 0.2 May 11, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster-manager node commissioning ansible task should timeout gracefully #176

cluster-manager node commissioning ansible task should timeout gracefully #176

rkharya commented Apr 25, 2016

rkharya commented Apr 25, 2016

mapuri commented Apr 25, 2016

erikh commented Apr 25, 2016

rkharya commented Apr 26, 2016

erikh commented Apr 26, 2016

cluster-manager node commissioning ansible task should timeout gracefully #176

cluster-manager node commissioning ansible task should timeout gracefully #176

Comments

rkharya commented Apr 25, 2016

rkharya commented Apr 25, 2016

mapuri commented Apr 25, 2016

erikh commented Apr 25, 2016

rkharya commented Apr 26, 2016

erikh commented Apr 26, 2016