Skip to content

Conversation

nogueiraanderson
Copy link
Contributor

Summary

Fixes critical logic bug in removeUntaggedEc2 Lambda where EKS clusters with valid billing tags were being deleted.

Problem

The Lambda was deleting EKS clusters that didn't match the skip pattern (pe-.*) without checking if they had valid billing tags.

Old (broken) logic:

  • EKS cluster doesn't match pe-.* → Delete entire cluster (ignored billing tag)

New (fixed) logic:

  • EKS instance has valid iit-billing-tag → Keep it (billing tag trumps everything)
  • EKS cluster matches pe-.* → Keep it (protected cluster)
  • EKS cluster has neither → Delete cluster

Changes

1. New Function: has_valid_billing_tag()

Validates iit-billing-tag values with support for:

  • Category strings: "pmm-staging", "jenkins-pmm-slave" → Valid
  • Unix timestamps: "1759837138" → Valid only if timestamp is in future (UTC)
  • Expired timestamps: Treated as invalid → Instance/cluster will be deleted

2. Updated: is_eks_managed_instance()

New priority order:

  1. Check billing tag first - If valid, skip (keep instance)
  2. Then check skip pattern - If matches pe-.*, skip (protected cluster)
  3. Neither - Mark cluster for deletion

3. Updated: is_instance_to_terminate()

Uses has_valid_billing_tag() for consistent validation across EKS and regular EC2.

Test Coverage

Comprehensive test suite with 12 scenarios in test_removeUntaggedEc2_logic.py:

Regular EC2:

  • ✓ Category tag → Kept
  • ✓ Future timestamp → Kept
  • ✗ Expired timestamp → Terminated
  • ✗ No tag, >10min → Terminated
  • ○ No tag, <10min → Kept (grace period)

EKS - Protected (pe-*):

  • pe-crossplane no tag → Kept (skip pattern)
  • pe-infra with tag → Kept (billing tag)

EKS - Non-Protected:

  • pmm-test with tag → Kept (billing tag)
  • pmm-temp with future timestamp → Kept (billing tag)
  • pmm-ha no tag → Cluster deleted
  • pmm-expired with expired timestamp → Cluster deleted

Run tests: python3 cloud/aws-functions/test_removeUntaggedEc2_logic.py

Behavior Summary

Regular EC2:

  • Has valid billing tag → Skip
  • No tag + >10min → Terminate
  • No tag + <10min → Skip (grace period)

EKS Instances:

  • Has valid billing tag → Skip (instance is legitimate)
  • No billing tag BUT matches pe-.* → Skip (protected cluster)
  • No billing tag AND doesn't match pattern → Delete entire cluster

Key Points:

  • Billing tag can be category string OR Unix timestamp
  • Timestamps validated for expiration (UTC)
  • EKS instances never terminated individually - whole cluster deleted via CloudFormation
  • Pattern pe-.* whitelists Platform Engineering clusters

Files Changed

  • cloud/aws-functions/removeUntaggedEc2.py - Fixed Lambda function
  • cloud/aws-functions/test_removeUntaggedEc2_logic.py - Test suite (no AWS connection needed)
  • cloud/aws-functions/README_removeUntaggedEc2.md - Documentation

Deployment

Before deploying, run validation:

python3 cloud/aws-functions/test_removeUntaggedEc2_logic.py

Deploy:

cd cloud/aws-functions
zip removeUntaggedEc2.zip removeUntaggedEc2.py

aws lambda update-function-code \
  --function-name removeUntaggedEc2 \
  --zip-file fileb://removeUntaggedEc2.zip \
  --region eu-west-1 \
  --profile percona-dev-admin

@nogueiraanderson nogueiraanderson force-pushed the feature/fix-removeUntaggedEc2-billing-tag-check branch 3 times, most recently from 44324ab to 8c542ef Compare October 6, 2025 12:09
@nogueiraanderson nogueiraanderson changed the title Fix removeUntaggedEc2: validate billing tags before deleting EKS clusters Fix removeUntaggedEc2: check iit-billing-tag before deleting EKS clusters Oct 6, 2025
Previously, EKS clusters without billing tags were deleted if they didn't
match the skip pattern, even if they had valid billing tags. This fix adds
proper billing tag validation for EKS instances.

Changes:
- Add has_valid_billing_tag() to validate category strings and timestamps
- Check billing tags BEFORE skip pattern for EKS instances
- Support Unix timestamp expiration in iit-billing-tag
- Add comprehensive test suite with 12 test scenarios
- All tests pass with proper timezone handling (UTC)
@nogueiraanderson nogueiraanderson force-pushed the feature/fix-removeUntaggedEc2-billing-tag-check branch from 8c542ef to e49ba68 Compare October 6, 2025 12:29
- Create IaC/RemoveUntaggedEc2Stack.yml with Lambda, IAM role, EventBridge rule
- Add justfile with deployment, health check, and development commands
- Lambda scans all AWS regions for untagged instances
- EventBridge triggers Lambda every 4 minutes
- DRY_RUN parameter controls whether deletions actually occur (defaults to true)
- Remove test file (test_removeUntaggedEc2_logic.py)
- Kebab-case command names, no emojis
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant