Commit 9799e3b
authored
[Release-3.14.1][Test] Add integration test for the fixes of issues caused by cluster update and rollback failure (#7150)
Add integration test to verify the following fixes work correctly:
- [F1] clustermgtd remains running after both update and rollback fail
(expected when failure occurs after slurm reconfiguration, which is the safe section)
- [F2] cfn-hup does not enter an endless loop after rollback to a state older than 24h
- [F3] dna.json files are cleaned up after update and rollback failure
Test scenario:
1. Create cluster with 3 static compute nodes
2. Inject cfn-signal failure on head node (simulating expired wait condition)
3. Disable cfn-hup on CN1 before update (causes update to fail)
4. Trigger cluster update (add new queue)
5. Wait for CN2 to apply update, then disable its cfn-hup
6. Update fails (CN1 didn't update), rollback fails (CN2 won't rollback)
7. Verify: clustermgtd running, dna.json cleaned up, CN3 has correct config version, metadata_db.json updated, no cfn-hup endless loop1 parent 7d38133 commit 9799e3b
File tree
5 files changed
+583
-0
lines changed- tests/integration-tests
- configs
- tests/update
- test_update_rollback_failure/test_update_rollback_failure
5 files changed
+583
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
825 | 825 | | |
826 | 826 | | |
827 | 827 | | |
| 828 | + | |
| 829 | + | |
| 830 | + | |
| 831 | + | |
| 832 | + | |
| 833 | + | |
828 | 834 | | |
829 | 835 | | |
830 | 836 | | |
| |||
0 commit comments