Skip to content

Commit 9799e3b

Browse files
authored
[Release-3.14.1][Test] Add integration test for the fixes of issues caused by cluster update and rollback failure (#7150)
Add integration test to verify the following fixes work correctly: - [F1] clustermgtd remains running after both update and rollback fail (expected when failure occurs after slurm reconfiguration, which is the safe section) - [F2] cfn-hup does not enter an endless loop after rollback to a state older than 24h - [F3] dna.json files are cleaned up after update and rollback failure Test scenario: 1. Create cluster with 3 static compute nodes 2. Inject cfn-signal failure on head node (simulating expired wait condition) 3. Disable cfn-hup on CN1 before update (causes update to fail) 4. Trigger cluster update (add new queue) 5. Wait for CN2 to apply update, then disable its cfn-hup 6. Update fails (CN1 didn't update), rollback fails (CN2 won't rollback) 7. Verify: clustermgtd running, dna.json cleaned up, CN3 has correct config version, metadata_db.json updated, no cfn-hup endless loop
1 parent 7d38133 commit 9799e3b

File tree

5 files changed

+583
-0
lines changed

5 files changed

+583
-0
lines changed

tests/integration-tests/configs/develop.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -825,6 +825,12 @@ test-suites:
825825
instances: {{ common.INSTANCES_DEFAULT_X86 }}
826826
oss: [{{ OS_X86_3 }}]
827827
schedulers: ["slurm"]
828+
test_update_rollback_failure.py::test_update_rollback_failure:
829+
dimensions:
830+
- regions: ["us-east-1"]
831+
instances: {{ common.INSTANCES_DEFAULT_X86 }}
832+
oss: [{{ OS_X86_2 }}]
833+
schedulers: ["slurm"]
828834
users:
829835
test_default_user_home.py::test_default_user_local_home:
830836
dimensions:

0 commit comments

Comments
 (0)