Skip to content

Commit 7d38133

Browse files
authored
[Release-3.14.1][Bug] Fix cfn-hup endless loop after rollback to cluster state older than 24 hours (#7146)
When a cluster update fails and triggers a rollback to a state older than 24 hours, cfn-hup enters an endless loop on the head node. This happens because: 1. The rollback restores the launch template metadata to reference an expired wait condition handle (wait conditions expire after 24h) 2. cfn-signal fails to signal the expired handle and returns non-zero 3. cfn-hup sees the non-zero exit code and does not update its local metadata cache (metadata_db.json) 4. On the next polling interval, cfn-hup detects the same "change" and re-triggers the update recipe, creating an infinite loop This fix appends `|| exit 0` to the update command, ensuring cfn-hup always updates its metadata cache regardless of whether cfn-signal succeeds or fails. This prevents the endless loop while still allowing CloudFormation to handle timeouts appropriately.
1 parent ac83e79 commit 7d38133

File tree

2 files changed

+8
-0
lines changed

2 files changed

+8
-0
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ CHANGELOG
2727
- Reduce EFA installation time for Ubuntu by ~20 minutes by only holding kernel packages for the installed kernel.
2828
- Add GetFunction and GetPolicy permissions to PClusterBuildImageCleanupRole to prevent AccessDenied errors during build image stack deletion.
2929
- Fix validation error messages when `DevSettings` is null or `DevSettings/InstanceTypesData` is missing required fields.
30+
- Fix an issue where cfn-hup enters an endless loop on the head node after a rollback to a cluster state older than 24 hours, caused by cfn-signal failing to signal an expired wait condition handle.
3031

3132
3.14.0
3233
------

cli/src/pcluster/templates/cluster_stack.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1496,6 +1496,12 @@ def _add_head_node(self):
14961496
"chefUpdate": {
14971497
"commands": {
14981498
"chef": {
1499+
# This command runs the update recipe and signals CloudFormation with the result.
1500+
# The trailing "|| exit 0" ensures cfn-hup always updates its local metadata cache
1501+
# (metadata_db.json) regardless of whether cfn-signal succeeds or fails.
1502+
# Without this, if cfn-signal fails (e.g., due to an expired wait condition handle
1503+
# after a rollback to a state older than 24h), cfn-hup would not update its cache
1504+
# and would enter an endless loop, re-triggering the update recipe every minute.
14991505
"command": (
15001506
". /etc/parallelcluster/pcluster_cookbook_environment.sh; "
15011507
"cinc-client --local-mode --config /etc/chef/client.rb --log_level info"
@@ -1509,6 +1515,7 @@ def _add_head_node(self):
15091515
f" $CFN_BOOTSTRAP_VIRTUALENV_PATH/cfn-signal --exit-code=1 --reason='Update failed'"
15101516
f" --region {self.stack.region} --url {cloudformation_url}"
15111517
f" '{self.wait_condition_handle.ref}'"
1518+
" || exit 0"
15121519
),
15131520
"cwd": "/etc/chef",
15141521
}

0 commit comments

Comments
 (0)