Skip to content

Commit 6119c22

Browse files
committed
fix
1 parent 5192f60 commit 6119c22

File tree

1 file changed

+7
-9
lines changed

1 file changed

+7
-9
lines changed

docs/experimental/slurm-controlled-rebuild.md

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -12,38 +12,36 @@ In summary, the way this functionality works is as follows:
1212

1313
1. The image references(s) are manually updated in the OpenTofu configuration
1414
in the normal way.
15-
2. ``` ansible-playbook lock_unlock_instances.yml
16-
--limit control,login -e "appliances_server_action=unlock"
17-
```
15+
2. `ansible-playbook lock_unlock_instances.yml --limit control,login -e "appliances_server_action=unlock"`
1816
is run to unlock the control and login nodes for reimaging.
19-
2. `tofu apply` is run which rebuilds the login and control nodes to the new
17+
3. `tofu apply` is run which rebuilds the login and control nodes to the new
2018
image(s). The new image reference for compute nodes is ignored, but is
2119
written into the hosts inventory file (and is therefore available as an
2220
Ansible hostvar).
23-
3. The `site.yml` playbook is run which locks the instances again and reconfigures
21+
4. The `site.yml` playbook is run which locks the instances again and reconfigures
2422
the cluster as normal. At this point the cluster is functional, but using a new
2523
image for the login and control nodes and the old image for the compute nodes.
2624
This playbook also:
2725
- Writes cluster configuration to the control node, using the
2826
[compute_init](../../ansible/roles/compute_init/README.md) role.
2927
- Configures an application credential and helper programs on the control
3028
node, using the [rebuild](../../ansible/roles/rebuild/README.md) role.
31-
4. An admin submits Slurm jobs, one for each node, to a special "rebuild"
29+
5. An admin submits Slurm jobs, one for each node, to a special "rebuild"
3230
partition using an Ansible playbook. Because this partition has higher
3331
priority than the partitions normal users can use, these rebuild jobs become
3432
the next job in the queue for every node (although any jobs currently
3533
running will complete as normal).
36-
5. Because these rebuild jobs have the `--reboot` flag set, before launching them
34+
6. Because these rebuild jobs have the `--reboot` flag set, before launching them
3735
the Slurm control node runs a [RebootProgram](https://slurm.schedmd.com/slurm.conf.html#OPT_RebootProgram)
3836
which compares the current image for the node to the one in the cluster
3937
configuration, and if it does not match, uses OpenStack to rebuild the
4038
node to the desired (updated) image.
4139
TODO: Describe the logic if they DO match
42-
6. After a rebuild, the compute node runs various Ansible tasks during boot,
40+
7. After a rebuild, the compute node runs various Ansible tasks during boot,
4341
controlled by the [compute_init](../../ansible/roles/compute_init/README.md)
4442
role, to fully configure the node again. It retrieves the required cluster
4543
configuration information from the control node via an NFS mount.
46-
7. Once the `slurmd` daemon starts on a compute node, the slurm controller
44+
8. Once the `slurmd` daemon starts on a compute node, the slurm controller
4745
registers the node as having finished rebooting. It then launches the actual
4846
job, which does not do anything.
4947

0 commit comments

Comments
 (0)