Open
Conversation
dbolotin
requested changes
Feb 20, 2026
44a2050 to
1411628
Compare
Split README.md into two guides: - Part A (README.md): CloudFormation point-and-click path with mermaid diagrams, 11-step walkthrough, verification checklist - Part B (advanced-installation.md): Manual CLI path with configuration block for all variables, inline IAM policy generation, scripted ACM validation — works for both human operators and AI agents CloudFormation changes: - Route53 (HostedZoneId, DomainName) mandatory — Desktop App requires TLS - S3 bucket always created (removed CreateS3Bucket toggle) - All IRSA roles always created (removed CreateAutoscalerRole, CreateALBControllerRole, CreateExternalDNSRole toggles) - Fix PrivateSubnetIds default to ',,' (CloudFormation !Select workaround) Other: - Add External DNS and Auto Scaling permissions to permissions.md - Remove orphaned cluster-autoscaler-policy.json
1411628 to
2409034
Compare
- Remove restrictive S3 bucket policy deny statement that locked out account administrators after stack deletion (IRSA role deleted with stack, leaving only root with access) - Add DependsOn: VpcGatewayAttachment to NatGateway (prevents intermittent creation failure) - Change EFS security group from CIDR-based to cluster SG reference (fixes existing-VPC path, tightens security) - Add DeletionPolicy: Retain to EFS filesystem (consistent with S3, prevents workspace data loss) - Add CloudWatch log group with 30-day retention (prevents orphaned log group after stack deletion) - Add VpcId and NodeGroupRoleArn to Outputs - Improve cleanup docs with deletion order warning, EFS and log group cleanup steps
59c2cee to
1fd8c39
Compare
dbolotin
requested changes
Feb 23, 2026
- Simplify architecture diagram: remove Cluster Autoscaler box, move EBS inside EKS boundary, add Desktop App → S3 data access link - Replace three-phases mermaid diagram with plain text - Add UiNodeAlwaysOn CloudFormation parameter: keeps 1 t3.xlarge running permanently (~$200/month) to eliminate the ~2-3 min cold start - Expand DNS/TLS section: explain what a hosted zone is, link to new domain-guide.md - Add AWS CLI authentication instructions to Step 2 (kubectl configure) - Add domain-guide.md: how to register a domain via Route53, find hosted zone ID, use existing domains
Now that CA, ALB Controller, and External DNS are bundled as sub-charts in the platforma chart, collapse 11 install steps down to 8: - Remove Step 3 (StorageClasses): now created by helm via storageClasses.* - Remove Steps 4-6 (CA, ALB, ExternalDNS separate helm installs): now sub-charts enabled via --set in a single helm install command - Step 10 becomes Step 7, includes all sub-chart --set flags and storageClasses.efs.fileSystemId Add ASG tagging step (new Step 3): CloudFormation cannot set dynamic tag keys, so node group ASGs still need the k8s.io/cluster-autoscaler/<name> tag added post-deploy. Update values-aws-s3.yaml: add storageClasses + sub-chart sections with static AWS defaults; dynamic values (ARNs, IDs, names) remain as --set flags. Update cleanup: platforma uninstall covers sub-charts (no separate releases).
…ace IRSA CloudFormation: - Add NodeGroupLaunchTemplate with TagSpecifications targeting ResourceType: auto-scaling-group — sets k8s.io/cluster-autoscaler/<name>=owned on the ASG itself at stack creation time, eliminating the post-deploy bash loop - Fix IRSA trust policies for cluster-autoscaler, aws-load-balancer-controller, and external-dns: change namespace from kube-system to PlatformaNamespace so sub-charts deployed in the platforma namespace can assume their roles Helm install (README): - 11 steps → 6 steps: ASG tagging removed (CF), SA creation removed (helm), Steps 3-4 eliminated, namespace created via --create-namespace - Pass PlatformaRoleArn via serviceAccount.annotations (chart creates SA) - Add aws-load-balancer-controller.region to the install command values-aws-s3.yaml: - Add explicit serviceAccount names for sub-charts (must match CF trust policies) - Add serviceAccount.annotations for Platforma SA - Add aws-load-balancer-controller.region advanced-installation.md: - Clarify this is the "operator manages everything" path - Update helm install to disable sub-charts and StorageClasses (managed manually in previous steps)
- README: separate CF outputs from user-supplied values in Step 5 preamble - README: add txtOwnerId explanation - README: fix verification checklist — controllers deploy to platforma namespace (not kube-system) - README: fix troubleshooting CA logs namespace (kube-system → platforma) - README: add working directory note for Step 3 (Kueue) - README: cleanup section — define variables, replace hardcoded stack name - CF: fix EfsFileSystemId output description (wrong values path) - CF: update PostDeploySteps to match 6-step README structure - advanced-installation.md: add Python 3 to prerequisites
README: - Step 1: clarify Outputs table covers only install-relevant outputs; note remaining are infrastructure-only - Step 1: warn subnet ID fields show ',,' by default — do not clear, it is a CF workaround - Step 4: explain why namespace must be pre-created (license secret must exist before Platforma starts) - Step 5: add note that chart creates Kueue queue resources (ClusterQueues, LocalQueues, ResourceFlavors) - Step 5: move ALB wait note before verify commands; note empty ADDRESS is normal while ALB provisions CloudFormation: - PostDeploySteps output: move step summary into Value field (visible in Outputs tab); Description kept as pointer to README
- Step 5: remove contradicting "namespace created automatically" — namespace must be pre-created (Step 4); service accounts are still auto-created - Step 5: clarify DOMAIN_FILTER is the hosted zone domain, not always the root; add sub-zone case and warn that wrong value causes silent ExternalDNS failure - Cleanup: drain AppWrapper CRs before deleting the controller to prevent CRD finalizer hang - CloudFormation PostDeploySteps: note "(step 1 = this stack)" so the value is self-contained when read in the Outputs tab
- Phase 2 summary: fix stale "namespace created automatically" — clarify namespace and license secret are created manually in Steps 3-4 - Step 6: port-forward note — Desktop App supports non-TLS to localhost, no certificate needed for this mode - Cleanup: add data loss warning before S3/EFS delete commands - Troubleshooting: add CloudFormation stack stuck in CREATE_IN_PROGRESS section covering ACM certificate validation failure (most common cause) - CloudFormation PlatformaNamespace output: clarify it is a reference only, created via kubectl in Step 4, not by the stack
- Step 4: rename heading to "Create namespace and license secret" - Step 5: remove --create-namespace flag — namespace is guaranteed by Step 4; if missing, helm fails with a clear error rather than silently proceeding without the license secret - Troubleshooting: add "kubectl get nodes returns no nodes or NotReady" entry covering wrong region and IAM permission issues
- PostDeploySteps CF output: rename step 4 to "create namespace and license secret" to match README Step 4 heading - Step 2: align placeholder style to use $REGION (consistent with Step 5); add note to set it early for reuse - Step 2: add timing caveat — nodes may still be initializing after stack completes; wait ~1 min if NotReady - Step 5: clarify values-aws-s3.yaml ships in the repo at that path - Step 6: explain port 6345 is Platforma's gRPC port
- Step 2: add explicit REGION= assignment before kubeconfig command; remove the "set it now if you want" suggestion — now a directive - Step 2: fix node count assertion to "the number you configured (2 by default)" so it's accurate when SystemNodeCount was changed - Step 3: add explicit `cd infrastructure/aws` command; note that kueue-values.yaml is pre-configured, no edits needed - Step 5: replace "Run from..." with cd reminder; extract DOMAIN_FILTER explanation from inside the code block into prose callout before it; simplify inline DOMAIN_FILTER comment to reference the note above - Cleanup: explain that S3 and EFS use DeletionPolicy: Retain in the CF template, so manual deletion is required (not just "for data safety") - values-aws-s3.yaml: add ingress section with placeholder comments; document that tls.secretName="" means ALB/ACM manages TLS
…h Lambda EC2 launch template TagSpecifications do not support 'auto-scaling-group' as a ResourceType — the API rejects it. EKS nodegroup Tags also do not propagate to the underlying ASG. A Lambda-backed custom resource is the only pure-CloudFormation way to set dynamic tag keys containing the cluster name on ASGs for Cluster Autoscaler autodiscovery. Changes: - Remove NodeGroupLaunchTemplate (its sole content was the broken TagSpecifications block; nodegroups use EKS-managed default template) - Remove LaunchTemplate references from all 5 nodegroups - Add ASGTaggingRole (IAM role with eks:DescribeNodegroup and autoscaling:CreateOrUpdateTags permissions) - Add ASGTaggingFunction (Python 3.12 Lambda, inline ZipFile code) that tags each nodegroup's ASG with k8s.io/cluster-autoscaler/enabled and k8s.io/cluster-autoscaler/<ClusterName> after nodegroup creation - Add 5 custom resources (TagSystem/Ui/BatchMedium/BatchLarge/BatchXlarge NodeGroup) — each DependsOn its respective nodegroup
- Prerequisites: add repo clone instruction (git clone milaboratory/platforma-helm) so users know where infrastructure/aws/ comes from before Step 3 - Step 1: note to record the stack name — needed in Cleanup section - Step 4: explain that platforma-license secret name and MI_LICENSE key are required by the chart and must not be changed - Step 5: make directory instruction unconditional (explicit cd code block) rather than conditional "if you changed directories" - Step 5: add callout after helm install explaining why --set-json is used for listen-ports (JSON array) and why tls.secretName="" is intentionally empty - Cleanup: add inline fallback command to retrieve the stack name if forgotten
…mand secretName defaults to empty in values-aws-s3.yaml; explicit --set is redundant and raised shell quoting concerns.
- namespace parameter: warn that all CLI commands assume default 'platforma' - Step 5 REGION variable: remove 'same value as Step 2' implication - cd commands: clarify path is relative to git clone location - port-forward: add kubectl get svc fallback for service name discovery - cleanup: clarify EFS mount targets vs filesystem in deletion warning - prerequisites: add Unix shell requirement (bash/zsh; WSL/CloudShell for Windows) - files table: add eksctl-cluster.yaml entry
Lambda tagger now sets Name=<ClusterName>/<NodegroupName> on each ASG with PropagateAtLaunch=true, so instances appear with descriptive names in the EC2 console (e.g. platforma-cluster/system, platforma-cluster/batch-medium).
…ilaboratory/platforma-helm into vladimir_antropov/MILAB-5645-aws
Remove manual ASG tagging loop from the infra deployer. EKS automatically tags managed nodegroup ASGs with eks:cluster-name, k8s.io/cluster-autoscaler/enabled, etc. — no need to replicate this in CodeBuild. Switch autoscaler autodiscovery from key-based matching (tag key contains cluster name) to value-based matching (tag=eks:cluster-name=<cluster-name>), which relies entirely on EKS-managed tags. Also removes the now-unnecessary IAM permissions (eks:ListNodegroups, eks:DescribeNodegroup, autoscaling:CreateOrUpdateTags) from the deployer role and cleans up redundant autoscaler tags from nodegroup resources and the launch template.
…ilaboratory/platforma-helm into vladimir_antropov/MILAB-5645-aws
Chart defaults to 8 CPU request, but m6a.2xlarge nodes only have ~7910m allocatable after kubelet reservation. Set request to 4 CPU (limit 8) so the pod can be scheduled on system nodes.
…ilaboratory/platforma-helm into vladimir_antropov/MILAB-5645-aws
- Add vpc-cni, kube-proxy, coredns as AWS::EKS::Addon for version governance and upgrade safety - Rewrite LDAP config to use YAML values file instead of --set-string flags, fixing comma-splitting (LDAP DNs) and word-splitting (passwords) - Move image override to YAML values file for consistency - Switch Platforma helm deploy from --wait to --atomic - Improve DataLibrary region parameter descriptions for cross-region clarity
Add AllowedPattern regex validation to 13 CF parameters:
- ClusterName: alphanumeric, hyphens, underscores
- LdapServer: require ldap:// or ldaps:// scheme
- S3BucketName + DataLibrary{1,2,3}Bucket: S3 naming rules
- DataLibrary{1,2,3}Region: AWS region format (incl. GovCloud)
- DataLibrary{1,2,3}AccessKey: AKIA/ASIA prefix + 16 chars
- VpcId: vpc- prefix + hex
All patterns allow empty strings where the parameter is optional.
Catches typos and format errors at stack creation time instead of
failing 20+ minutes later in CodeBuild.
…ilaboratory/platforma-helm into vladimir_antropov/MILAB-5645-aws
- Add optimized screenshots (resized to 1200px, ~50% smaller) - Replace single cf-parameters.png with 4 section-specific screenshots - Add cf-outputs.png and desktop-app.png - Sync advanced-installation instance types: m5→m6a, r5→r6a - Sync pool names: batch-medium/large/xlarge → batch-16c-64g/32c-128g/etc - Sync autoscaler tags: k8s.io/cluster-autoscaler → eks:cluster-name - Bump default PlatformaVersion to 3.0.0-rc.19
eksctl-cluster.yaml: - Instance types: m5→m6a, r5→r6a (match CF) - Pool names: batch-medium/large/xlarge → batch-16c-64g/32c-128g/etc - Max sizes: match CF "small" tier (4/2/1/2/1) - Remove k8s.io/cluster-autoscaler/* tags (EKS auto-tags managed node groups) - Add platforma tag, pool-specific nodegroup-type tags values-aws-s3.yaml: - Add kueue.maxJobResources (62 CPU / 500Gi) - Add app.resources (requests 4/16Gi, limits 8/16Gi) matching CF - Add jobServiceAccount section - Update Kueue quotas to match CF "small" tier
- CF defaults: DeployPlatforma true, PlatformaVersion 3.0.0 - Advanced guide: admin→platforma username, add --atomic --timeout 15m - Advanced guide: fix "Helm sub-charts" → "installed via CodeBuild" - Advanced guide: update Step 11 connect instructions - values-aws-s3.yaml: admin→platforma in comment - README: reorder sections, sync defaults with CF
Without Version: !Ref KubernetesVersion on node groups, CloudFormation skips node group updates when KubernetesVersion changes. Also add KubernetesVersion to TriggerHelmDeploy properties so version changes auto-trigger the helm deployer. Tested on PlatformaProd35: EKS 1.34→1.35 upgrade succeeded with control plane, all node groups, and autoscaler v1.35.0.
- External DNS: zoneIdFilters is silently ignored by the chart. Use extraArgs[0]=--zone-id-filter instead (both CF and guide). - ALB Controller: IAM policy URL was v2.11.0 but chart is v3.0.0. Updated to download v3.0.0 policy. - Swapped Steps 4/5 in advanced guide: install External DNS before ALB Controller to avoid mutating webhook race condition. - Added --atomic to autoscaler and External DNS helm installs. - Added --set region and vpcId to ALB Controller install. - Removed unused DOMAIN_FILTER variable.
Clear-writing pass on README.md and advanced-installation.md: active voice, cut redundancy, sharpen vague phrasing. Bug fixes from blind agent review: - S3 bucket creation fails in us-east-1 (LocationConstraint must be omitted) - Autoscaler image tag v1.34.0 → v1.34.2 (matches chart 9.53.0 default) - Add password placeholder warning in Step 10 - Add session variable recovery block before Step 10
blackcat
requested changes
Mar 6, 2026
dbolotin
reviewed
Mar 6, 2026
… validation - HostedZoneId: use AWS::Route53::HostedZone::Id type for console dropdown - Remove NoEcho from LicenseKey and DataLibrary AccessKey params (not secrets) - Add Rules to validate DataLibrary AccessKey/SecretKey must be set together Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
S3 buckets are retained on stack deletion for data safety. With deterministic names (platforma-<cluster>-<account>), retrying after a failed deployment collides with the retained bucket. CF template: derive 8-char hex suffix from AWS::StackId UUID. Advanced guide: use openssl rand -hex 4. Both produce unique names per deployment attempt. Also adds S3BucketOutput to CF Outputs and bucket name recovery to both the Step 10 and Cleanup sections in the advanced guide.
All node groups updated across CF template, eksctl template, README, and advanced installation guide.
…tputs - UsersPasswordSSMConsole: direct link to SSM parameter page - DefaultUsername: show the auto-generated username (platforma) - All three password/username outputs conditional on AuthMethod=htpasswd
maxIO performance mode cannot be used with Elastic throughput mode (our default). Hardcode generalPurpose and remove the parameter.
…' into vladimir_antropov/MILAB-5645-aws # Conflicts: # infrastructure/aws/README.md
Diagram: remove Kueue intermediate node (show pools directly), define S3 node before desktop link so cylinder shape renders, keep EBS inside EKS subgraph. Outputs: add DefaultUsername, UsersPasswordSSMConsole. Fix stale S3 bucket description (AccountId → random).
LDAP distinguished names contain commas (e.g. cn=users,ou=groups,dc=example,dc=com). The buildspec split search rules on commas, destroying DNs. Switch to semicolons so multi-rule configs work without breaking DN syntax. Also update parameter description with group filter example.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.