@@ -12,38 +12,36 @@ In summary, the way this functionality works is as follows:
1212
13131 . The image references(s) are manually updated in the OpenTofu configuration
1414 in the normal way.
15- 2 . ``` ansible-playbook lock_unlock_instances.yml
16- --limit control,login -e "appliances_server_action=unlock"
17- ```
15+ 2 . ` ansible-playbook lock_unlock_instances.yml --limit control,login -e "appliances_server_action=unlock" `
1816 is run to unlock the control and login nodes for reimaging.
19- 2 . ` tofu apply ` is run which rebuilds the login and control nodes to the new
17+ 3 . ` tofu apply ` is run which rebuilds the login and control nodes to the new
2018 image(s). The new image reference for compute nodes is ignored, but is
2119 written into the hosts inventory file (and is therefore available as an
2220 Ansible hostvar).
23- 3 . The ` site.yml ` playbook is run which locks the instances again and reconfigures
21+ 4 . The ` site.yml ` playbook is run which locks the instances again and reconfigures
2422 the cluster as normal. At this point the cluster is functional, but using a new
2523 image for the login and control nodes and the old image for the compute nodes.
2624 This playbook also:
2725 - Writes cluster configuration to the control node, using the
2826 [ compute_init] ( ../../ansible/roles/compute_init/README.md ) role.
2927 - Configures an application credential and helper programs on the control
3028 node, using the [ rebuild] ( ../../ansible/roles/rebuild/README.md ) role.
31- 4 . An admin submits Slurm jobs, one for each node, to a special "rebuild"
29+ 5 . An admin submits Slurm jobs, one for each node, to a special "rebuild"
3230 partition using an Ansible playbook. Because this partition has higher
3331 priority than the partitions normal users can use, these rebuild jobs become
3432 the next job in the queue for every node (although any jobs currently
3533 running will complete as normal).
36- 5 . Because these rebuild jobs have the ` --reboot ` flag set, before launching them
34+ 6 . Because these rebuild jobs have the ` --reboot ` flag set, before launching them
3735 the Slurm control node runs a [ RebootProgram] ( https://slurm.schedmd.com/slurm.conf.html#OPT_RebootProgram )
3836 which compares the current image for the node to the one in the cluster
3937 configuration, and if it does not match, uses OpenStack to rebuild the
4038 node to the desired (updated) image.
4139 TODO: Describe the logic if they DO match
42- 6 . After a rebuild, the compute node runs various Ansible tasks during boot,
40+ 7 . After a rebuild, the compute node runs various Ansible tasks during boot,
4341 controlled by the [ compute_init] ( ../../ansible/roles/compute_init/README.md )
4442 role, to fully configure the node again. It retrieves the required cluster
4543 configuration information from the control node via an NFS mount.
46- 7 . Once the ` slurmd ` daemon starts on a compute node, the slurm controller
44+ 8 . Once the ` slurmd ` daemon starts on a compute node, the slurm controller
4745 registers the node as having finished rebooting. It then launches the actual
4846 job, which does not do anything.
4947
0 commit comments