feat: aws deployment docs by mike-ainsel · Pull Request #1 · milaboratory/platforma-helm

mike-ainsel · 2026-02-19T12:07:26Z

No description provided.

notion-workspace · 2026-02-19T12:07:36Z

[infra/k8s] M2 (risk) final helm values polish

infrastructure/aws/README.md

Split README.md into two guides: - Part A (README.md): CloudFormation point-and-click path with mermaid diagrams, 11-step walkthrough, verification checklist - Part B (advanced-installation.md): Manual CLI path with configuration block for all variables, inline IAM policy generation, scripted ACM validation — works for both human operators and AI agents CloudFormation changes: - Route53 (HostedZoneId, DomainName) mandatory — Desktop App requires TLS - S3 bucket always created (removed CreateS3Bucket toggle) - All IRSA roles always created (removed CreateAutoscalerRole, CreateALBControllerRole, CreateExternalDNSRole toggles) - Fix PrivateSubnetIds default to ',,' (CloudFormation !Select workaround) Other: - Add External DNS and Auto Scaling permissions to permissions.md - Remove orphaned cluster-autoscaler-policy.json

- Remove restrictive S3 bucket policy deny statement that locked out account administrators after stack deletion (IRSA role deleted with stack, leaving only root with access) - Add DependsOn: VpcGatewayAttachment to NatGateway (prevents intermittent creation failure) - Change EFS security group from CIDR-based to cluster SG reference (fixes existing-VPC path, tightens security) - Add DeletionPolicy: Retain to EFS filesystem (consistent with S3, prevents workspace data loss) - Add CloudWatch log group with 30-day retention (prevents orphaned log group after stack deletion) - Add VpcId and NodeGroupRoleArn to Outputs - Improve cleanup docs with deletion order warning, EFS and log group cleanup steps

infrastructure/aws/README.md

- Simplify architecture diagram: remove Cluster Autoscaler box, move EBS inside EKS boundary, add Desktop App → S3 data access link - Replace three-phases mermaid diagram with plain text - Add UiNodeAlwaysOn CloudFormation parameter: keeps 1 t3.xlarge running permanently (~$200/month) to eliminate the ~2-3 min cold start - Expand DNS/TLS section: explain what a hosted zone is, link to new domain-guide.md - Add AWS CLI authentication instructions to Step 2 (kubectl configure) - Add domain-guide.md: how to register a domain via Route53, find hosted zone ID, use existing domains

Now that CA, ALB Controller, and External DNS are bundled as sub-charts in the platforma chart, collapse 11 install steps down to 8: - Remove Step 3 (StorageClasses): now created by helm via storageClasses.* - Remove Steps 4-6 (CA, ALB, ExternalDNS separate helm installs): now sub-charts enabled via --set in a single helm install command - Step 10 becomes Step 7, includes all sub-chart --set flags and storageClasses.efs.fileSystemId Add ASG tagging step (new Step 3): CloudFormation cannot set dynamic tag keys, so node group ASGs still need the k8s.io/cluster-autoscaler/<name> tag added post-deploy. Update values-aws-s3.yaml: add storageClasses + sub-chart sections with static AWS defaults; dynamic values (ARNs, IDs, names) remain as --set flags. Update cleanup: platforma uninstall covers sub-charts (no separate releases).

…ace IRSA CloudFormation: - Add NodeGroupLaunchTemplate with TagSpecifications targeting ResourceType: auto-scaling-group — sets k8s.io/cluster-autoscaler/<name>=owned on the ASG itself at stack creation time, eliminating the post-deploy bash loop - Fix IRSA trust policies for cluster-autoscaler, aws-load-balancer-controller, and external-dns: change namespace from kube-system to PlatformaNamespace so sub-charts deployed in the platforma namespace can assume their roles Helm install (README): - 11 steps → 6 steps: ASG tagging removed (CF), SA creation removed (helm), Steps 3-4 eliminated, namespace created via --create-namespace - Pass PlatformaRoleArn via serviceAccount.annotations (chart creates SA) - Add aws-load-balancer-controller.region to the install command values-aws-s3.yaml: - Add explicit serviceAccount names for sub-charts (must match CF trust policies) - Add serviceAccount.annotations for Platforma SA - Add aws-load-balancer-controller.region advanced-installation.md: - Clarify this is the "operator manages everything" path - Update helm install to disable sub-charts and StorageClasses (managed manually in previous steps)

- README: separate CF outputs from user-supplied values in Step 5 preamble - README: add txtOwnerId explanation - README: fix verification checklist — controllers deploy to platforma namespace (not kube-system) - README: fix troubleshooting CA logs namespace (kube-system → platforma) - README: add working directory note for Step 3 (Kueue) - README: cleanup section — define variables, replace hardcoded stack name - CF: fix EfsFileSystemId output description (wrong values path) - CF: update PostDeploySteps to match 6-step README structure - advanced-installation.md: add Python 3 to prerequisites

README: - Step 1: clarify Outputs table covers only install-relevant outputs; note remaining are infrastructure-only - Step 1: warn subnet ID fields show ',,' by default — do not clear, it is a CF workaround - Step 4: explain why namespace must be pre-created (license secret must exist before Platforma starts) - Step 5: add note that chart creates Kueue queue resources (ClusterQueues, LocalQueues, ResourceFlavors) - Step 5: move ALB wait note before verify commands; note empty ADDRESS is normal while ALB provisions CloudFormation: - PostDeploySteps output: move step summary into Value field (visible in Outputs tab); Description kept as pointer to README

- Step 5: remove contradicting "namespace created automatically" — namespace must be pre-created (Step 4); service accounts are still auto-created - Step 5: clarify DOMAIN_FILTER is the hosted zone domain, not always the root; add sub-zone case and warn that wrong value causes silent ExternalDNS failure - Cleanup: drain AppWrapper CRs before deleting the controller to prevent CRD finalizer hang - CloudFormation PostDeploySteps: note "(step 1 = this stack)" so the value is self-contained when read in the Outputs tab

- Phase 2 summary: fix stale "namespace created automatically" — clarify namespace and license secret are created manually in Steps 3-4 - Step 6: port-forward note — Desktop App supports non-TLS to localhost, no certificate needed for this mode - Cleanup: add data loss warning before S3/EFS delete commands - Troubleshooting: add CloudFormation stack stuck in CREATE_IN_PROGRESS section covering ACM certificate validation failure (most common cause) - CloudFormation PlatformaNamespace output: clarify it is a reference only, created via kubectl in Step 4, not by the stack

- Step 4: rename heading to "Create namespace and license secret" - Step 5: remove --create-namespace flag — namespace is guaranteed by Step 4; if missing, helm fails with a clear error rather than silently proceeding without the license secret - Troubleshooting: add "kubectl get nodes returns no nodes or NotReady" entry covering wrong region and IAM permission issues

- PostDeploySteps CF output: rename step 4 to "create namespace and license secret" to match README Step 4 heading - Step 2: align placeholder style to use $REGION (consistent with Step 5); add note to set it early for reuse - Step 2: add timing caveat — nodes may still be initializing after stack completes; wait ~1 min if NotReady - Step 5: clarify values-aws-s3.yaml ships in the repo at that path - Step 6: explain port 6345 is Platforma's gRPC port

- Step 2: add explicit REGION= assignment before kubeconfig command; remove the "set it now if you want" suggestion — now a directive - Step 2: fix node count assertion to "the number you configured (2 by default)" so it's accurate when SystemNodeCount was changed - Step 3: add explicit `cd infrastructure/aws` command; note that kueue-values.yaml is pre-configured, no edits needed - Step 5: replace "Run from..." with cd reminder; extract DOMAIN_FILTER explanation from inside the code block into prose callout before it; simplify inline DOMAIN_FILTER comment to reference the note above - Cleanup: explain that S3 and EFS use DeletionPolicy: Retain in the CF template, so manual deletion is required (not just "for data safety") - values-aws-s3.yaml: add ingress section with placeholder comments; document that tls.secretName="" means ALB/ACM manages TLS

…h Lambda EC2 launch template TagSpecifications do not support 'auto-scaling-group' as a ResourceType — the API rejects it. EKS nodegroup Tags also do not propagate to the underlying ASG. A Lambda-backed custom resource is the only pure-CloudFormation way to set dynamic tag keys containing the cluster name on ASGs for Cluster Autoscaler autodiscovery. Changes: - Remove NodeGroupLaunchTemplate (its sole content was the broken TagSpecifications block; nodegroups use EKS-managed default template) - Remove LaunchTemplate references from all 5 nodegroups - Add ASGTaggingRole (IAM role with eks:DescribeNodegroup and autoscaling:CreateOrUpdateTags permissions) - Add ASGTaggingFunction (Python 3.12 Lambda, inline ZipFile code) that tags each nodegroup's ASG with k8s.io/cluster-autoscaler/enabled and k8s.io/cluster-autoscaler/<ClusterName> after nodegroup creation - Add 5 custom resources (TagSystem/Ui/BatchMedium/BatchLarge/BatchXlarge NodeGroup) — each DependsOn its respective nodegroup

- Prerequisites: add repo clone instruction (git clone milaboratory/platforma-helm) so users know where infrastructure/aws/ comes from before Step 3 - Step 1: note to record the stack name — needed in Cleanup section - Step 4: explain that platforma-license secret name and MI_LICENSE key are required by the chart and must not be changed - Step 5: make directory instruction unconditional (explicit cd code block) rather than conditional "if you changed directories" - Step 5: add callout after helm install explaining why --set-json is used for listen-ports (JSON array) and why tls.secretName="" is intentionally empty - Cleanup: add inline fallback command to retrieve the stack name if forgotten

…mand secretName defaults to empty in values-aws-s3.yaml; explicit --set is redundant and raised shell quoting concerns.

- namespace parameter: warn that all CLI commands assume default 'platforma' - Step 5 REGION variable: remove 'same value as Step 2' implication - cd commands: clarify path is relative to git clone location - port-forward: add kubectl get svc fallback for service name discovery - cleanup: clarify EFS mount targets vs filesystem in deletion warning - prerequisites: add Unix shell requirement (bash/zsh; WSL/CloudShell for Windows) - files table: add eksctl-cluster.yaml entry

Lambda tagger now sets Name=<ClusterName>/<NodegroupName> on each ASG with PropagateAtLaunch=true, so instances appear with descriptive names in the EC2 console (e.g. platforma-cluster/system, platforma-cluster/batch-medium).

…ilaboratory/platforma-helm into vladimir_antropov/MILAB-5645-aws

Remove manual ASG tagging loop from the infra deployer. EKS automatically tags managed nodegroup ASGs with eks:cluster-name, k8s.io/cluster-autoscaler/enabled, etc. — no need to replicate this in CodeBuild. Switch autoscaler autodiscovery from key-based matching (tag key contains cluster name) to value-based matching (tag=eks:cluster-name=<cluster-name>), which relies entirely on EKS-managed tags. Also removes the now-unnecessary IAM permissions (eks:ListNodegroups, eks:DescribeNodegroup, autoscaling:CreateOrUpdateTags) from the deployer role and cleans up redundant autoscaler tags from nodegroup resources and the launch template.

…ilaboratory/platforma-helm into vladimir_antropov/MILAB-5645-aws

Chart defaults to 8 CPU request, but m6a.2xlarge nodes only have ~7910m allocatable after kubelet reservation. Set request to 4 CPU (limit 8) so the pod can be scheduled on system nodes.

…ilaboratory/platforma-helm into vladimir_antropov/MILAB-5645-aws

- Add vpc-cni, kube-proxy, coredns as AWS::EKS::Addon for version governance and upgrade safety - Rewrite LDAP config to use YAML values file instead of --set-string flags, fixing comma-splitting (LDAP DNs) and word-splitting (passwords) - Move image override to YAML values file for consistency - Switch Platforma helm deploy from --wait to --atomic - Improve DataLibrary region parameter descriptions for cross-region clarity

Add AllowedPattern regex validation to 13 CF parameters: - ClusterName: alphanumeric, hyphens, underscores - LdapServer: require ldap:// or ldaps:// scheme - S3BucketName + DataLibrary{1,2,3}Bucket: S3 naming rules - DataLibrary{1,2,3}Region: AWS region format (incl. GovCloud) - DataLibrary{1,2,3}AccessKey: AKIA/ASIA prefix + 16 chars - VpcId: vpc- prefix + hex All patterns allow empty strings where the parameter is optional. Catches typos and format errors at stack creation time instead of failing 20+ minutes later in CodeBuild.

…ilaboratory/platforma-helm into vladimir_antropov/MILAB-5645-aws

- Add optimized screenshots (resized to 1200px, ~50% smaller) - Replace single cf-parameters.png with 4 section-specific screenshots - Add cf-outputs.png and desktop-app.png - Sync advanced-installation instance types: m5→m6a, r5→r6a - Sync pool names: batch-medium/large/xlarge → batch-16c-64g/32c-128g/etc - Sync autoscaler tags: k8s.io/cluster-autoscaler → eks:cluster-name - Bump default PlatformaVersion to 3.0.0-rc.19

eksctl-cluster.yaml: - Instance types: m5→m6a, r5→r6a (match CF) - Pool names: batch-medium/large/xlarge → batch-16c-64g/32c-128g/etc - Max sizes: match CF "small" tier (4/2/1/2/1) - Remove k8s.io/cluster-autoscaler/* tags (EKS auto-tags managed node groups) - Add platforma tag, pool-specific nodegroup-type tags values-aws-s3.yaml: - Add kueue.maxJobResources (62 CPU / 500Gi) - Add app.resources (requests 4/16Gi, limits 8/16Gi) matching CF - Add jobServiceAccount section - Update Kueue quotas to match CF "small" tier

- CF defaults: DeployPlatforma true, PlatformaVersion 3.0.0 - Advanced guide: admin→platforma username, add --atomic --timeout 15m - Advanced guide: fix "Helm sub-charts" → "installed via CodeBuild" - Advanced guide: update Step 11 connect instructions - values-aws-s3.yaml: admin→platforma in comment - README: reorder sections, sync defaults with CF

Without Version: !Ref KubernetesVersion on node groups, CloudFormation skips node group updates when KubernetesVersion changes. Also add KubernetesVersion to TriggerHelmDeploy properties so version changes auto-trigger the helm deployer. Tested on PlatformaProd35: EKS 1.34→1.35 upgrade succeeded with control plane, all node groups, and autoscaler v1.35.0.

- External DNS: zoneIdFilters is silently ignored by the chart. Use extraArgs[0]=--zone-id-filter instead (both CF and guide). - ALB Controller: IAM policy URL was v2.11.0 but chart is v3.0.0. Updated to download v3.0.0 policy. - Swapped Steps 4/5 in advanced guide: install External DNS before ALB Controller to avoid mutating webhook race condition. - Added --atomic to autoscaler and External DNS helm installs. - Added --set region and vpcId to ALB Controller install. - Removed unused DOMAIN_FILTER variable.

Clear-writing pass on README.md and advanced-installation.md: active voice, cut redundancy, sharpen vague phrasing. Bug fixes from blind agent review: - S3 bucket creation fails in us-east-1 (LocationConstraint must be omitted) - Autoscaler image tag v1.34.0 → v1.34.2 (matches chart 9.53.0 default) - Add password placeholder warning in Step 10 - Add session variable recovery block before Step 10

infrastructure/aws/README.md

… validation - HostedZoneId: use AWS::Route53::HostedZone::Id type for console dropdown - Remove NoEcho from LicenseKey and DataLibrary AccessKey params (not secrets) - Add Rules to validate DataLibrary AccessKey/SecretKey must be set together Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

S3 buckets are retained on stack deletion for data safety. With deterministic names (platforma-<cluster>-<account>), retrying after a failed deployment collides with the retained bucket. CF template: derive 8-char hex suffix from AWS::StackId UUID. Advanced guide: use openssl rand -hex 4. Both produce unique names per deployment attempt. Also adds S3BucketOutput to CF Outputs and bucket name recovery to both the Step 10 and Cleanup sections in the advanced guide.

All node groups updated across CF template, eksctl template, README, and advanced installation guide.

…tputs - UsersPasswordSSMConsole: direct link to SSM parameter page - DefaultUsername: show the auto-generated username (platforma) - All three password/username outputs conditional on AuthMethod=htpasswd

maxIO performance mode cannot be used with Elastic throughput mode (our default). Hardcode generalPurpose and remove the parameter.

…' into vladimir_antropov/MILAB-5645-aws # Conflicts: # infrastructure/aws/README.md

Diagram: remove Kueue intermediate node (show pools directly), define S3 node before desktop link so cylinder shape renders, keep EBS inside EKS subgraph. Outputs: add DefaultUsername, UsersPasswordSSMConsole. Fix stale S3 bucket description (AccountId → random).

LDAP distinguished names contain commas (e.g. cn=users,ou=groups,dc=example,dc=com). The buildspec split search rules on commas, destroying DNs. Switch to semicolons so multi-rule configs work without breaking DN syntax. Also update parameter description with group filter example.

feat: aws deployment docs

bea8130

dbolotin requested changes Feb 20, 2026

View reviewed changes

mike-ainsel force-pushed the vladimir_antropov/MILAB-5645-aws branch 4 times, most recently from 44a2050 to 1411628 Compare February 20, 2026 14:59

mike-ainsel force-pushed the vladimir_antropov/MILAB-5645-aws branch from 1411628 to 2409034 Compare February 20, 2026 15:01

mike-ainsel added 3 commits February 20, 2026 17:22

Add CloudFormation screenshots to AWS deployment guide

f2bf373

Add .gitignore to exclude .DS_Store files

0168ab7

mike-ainsel force-pushed the vladimir_antropov/MILAB-5645-aws branch from 59c2cee to 1fd8c39 Compare February 20, 2026 17:36

dbolotin requested changes Feb 23, 2026

View reviewed changes

mike-ainsel added 16 commits February 23, 2026 15:52

docs(aws): make storageClass naming contract explicit in values file

c2e1624

docs(aws): remove redundant tls.secretName flag from helm install com…

a056cdb

…mand secretName defaults to empty in values-aws-s3.yaml; explicit --set is redundant and raised shell quoting concerns.

mike-ainsel and others added 15 commits March 4, 2026 00:37

Merge branch 'vladimir_antropov/MILAB-5645-aws' of ssh://github.com/m…

f27e64c

…ilaboratory/platforma-helm into vladimir_antropov/MILAB-5645-aws

Merge branch 'vladimir_antropov/MILAB-5645-aws' of ssh://github.com/m…

2352601

…ilaboratory/platforma-helm into vladimir_antropov/MILAB-5645-aws

fix: override chart CPU request to fit on m6a.2xlarge system nodes

dc372fc

Chart defaults to 8 CPU request, but m6a.2xlarge nodes only have ~7910m allocatable after kubelet reservation. Set request to 4 CPU (limit 8) so the pod can be scheduled on system nodes.

Merge branch 'vladimir_antropov/MILAB-5645-aws' of ssh://github.com/m…

c034a3a

…ilaboratory/platforma-helm into vladimir_antropov/MILAB-5645-aws

Merge branch 'vladimir_antropov/MILAB-5645-aws' of ssh://github.com/m…

a031571

…ilaboratory/platforma-helm into vladimir_antropov/MILAB-5645-aws

docs: add S3 URL for CloudFormation template download

7019bba

blackcat requested changes Mar 6, 2026

View reviewed changes

dbolotin reviewed Mar 6, 2026

View reviewed changes

infrastructure/aws/README.md Outdated Show resolved Hide resolved

dbolotin and others added 12 commits March 6, 2026 17:46

Switch compute instance types from m6a/r6a to m7i/r7i

dd6eca4

All node groups updated across CF template, eksctl template, README, and advanced installation guide.

.gitignore .idea/

e53690d

text clarify

5fc1fdb

Add clickable SSM link, default username, and conditional htpasswd ou…

03d3704

…tputs - UsersPasswordSSMConsole: direct link to SSM parameter page - DefaultUsername: show the auto-generated username (platforma) - All three password/username outputs conditional on AuthMethod=htpasswd

Remove EFS maxIO option — incompatible with Elastic throughput

80986ee

maxIO performance mode cannot be used with Elastic throughput mode (our default). Hardcode generalPurpose and remove the parameter.

Merge remote-tracking branch 'origin/vladimir_antropov/MILAB-5645-aws…

8852a4e

…' into vladimir_antropov/MILAB-5645-aws # Conflicts: # infrastructure/aws/README.md

headers..

d8ed9cc

text fixed

ef7f690

mike-ainsel changed the base branch from feat/new-chart to main March 7, 2026 12:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: aws deployment docs#1

feat: aws deployment docs#1
mike-ainsel wants to merge 124 commits intomainfrom
vladimir_antropov/MILAB-5645-aws

mike-ainsel commented Feb 19, 2026

Uh oh!

notion-workspace bot commented Feb 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mike-ainsel commented Feb 19, 2026

Uh oh!

notion-workspace bot commented Feb 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants