Skip to content

feat: aws deployment docs#1

Open
mike-ainsel wants to merge 124 commits intomainfrom
vladimir_antropov/MILAB-5645-aws
Open

feat: aws deployment docs#1
mike-ainsel wants to merge 124 commits intomainfrom
vladimir_antropov/MILAB-5645-aws

Conversation

@mike-ainsel
Copy link
Member

No description provided.

@notion-workspace
Copy link

@mike-ainsel mike-ainsel force-pushed the vladimir_antropov/MILAB-5645-aws branch 4 times, most recently from 44a2050 to 1411628 Compare February 20, 2026 14:59
Split README.md into two guides:
- Part A (README.md): CloudFormation point-and-click path with mermaid
  diagrams, 11-step walkthrough, verification checklist
- Part B (advanced-installation.md): Manual CLI path with configuration
  block for all variables, inline IAM policy generation, scripted ACM
  validation — works for both human operators and AI agents

CloudFormation changes:
- Route53 (HostedZoneId, DomainName) mandatory — Desktop App requires TLS
- S3 bucket always created (removed CreateS3Bucket toggle)
- All IRSA roles always created (removed CreateAutoscalerRole,
  CreateALBControllerRole, CreateExternalDNSRole toggles)
- Fix PrivateSubnetIds default to ',,' (CloudFormation !Select workaround)

Other:
- Add External DNS and Auto Scaling permissions to permissions.md
- Remove orphaned cluster-autoscaler-policy.json
@mike-ainsel mike-ainsel force-pushed the vladimir_antropov/MILAB-5645-aws branch from 1411628 to 2409034 Compare February 20, 2026 15:01
- Remove restrictive S3 bucket policy deny statement that locked out
  account administrators after stack deletion (IRSA role deleted with
  stack, leaving only root with access)
- Add DependsOn: VpcGatewayAttachment to NatGateway (prevents
  intermittent creation failure)
- Change EFS security group from CIDR-based to cluster SG reference
  (fixes existing-VPC path, tightens security)
- Add DeletionPolicy: Retain to EFS filesystem (consistent with S3,
  prevents workspace data loss)
- Add CloudWatch log group with 30-day retention (prevents orphaned
  log group after stack deletion)
- Add VpcId and NodeGroupRoleArn to Outputs
- Improve cleanup docs with deletion order warning, EFS and log group
  cleanup steps
@mike-ainsel mike-ainsel force-pushed the vladimir_antropov/MILAB-5645-aws branch from 59c2cee to 1fd8c39 Compare February 20, 2026 17:36
- Simplify architecture diagram: remove Cluster Autoscaler box, move EBS
  inside EKS boundary, add Desktop App → S3 data access link
- Replace three-phases mermaid diagram with plain text
- Add UiNodeAlwaysOn CloudFormation parameter: keeps 1 t3.xlarge running
  permanently (~$200/month) to eliminate the ~2-3 min cold start
- Expand DNS/TLS section: explain what a hosted zone is, link to new
  domain-guide.md
- Add AWS CLI authentication instructions to Step 2 (kubectl configure)
- Add domain-guide.md: how to register a domain via Route53, find hosted
  zone ID, use existing domains
Now that CA, ALB Controller, and External DNS are bundled as sub-charts
in the platforma chart, collapse 11 install steps down to 8:

- Remove Step 3 (StorageClasses): now created by helm via storageClasses.*
- Remove Steps 4-6 (CA, ALB, ExternalDNS separate helm installs): now
  sub-charts enabled via --set in a single helm install command
- Step 10 becomes Step 7, includes all sub-chart --set flags and
  storageClasses.efs.fileSystemId

Add ASG tagging step (new Step 3): CloudFormation cannot set dynamic
tag keys, so node group ASGs still need the k8s.io/cluster-autoscaler/<name>
tag added post-deploy.

Update values-aws-s3.yaml: add storageClasses + sub-chart sections with
static AWS defaults; dynamic values (ARNs, IDs, names) remain as --set flags.

Update cleanup: platforma uninstall covers sub-charts (no separate releases).
…ace IRSA

CloudFormation:
- Add NodeGroupLaunchTemplate with TagSpecifications targeting
  ResourceType: auto-scaling-group — sets k8s.io/cluster-autoscaler/<name>=owned
  on the ASG itself at stack creation time, eliminating the post-deploy bash loop
- Fix IRSA trust policies for cluster-autoscaler, aws-load-balancer-controller,
  and external-dns: change namespace from kube-system to PlatformaNamespace so
  sub-charts deployed in the platforma namespace can assume their roles

Helm install (README):
- 11 steps → 6 steps: ASG tagging removed (CF), SA creation removed (helm),
  Steps 3-4 eliminated, namespace created via --create-namespace
- Pass PlatformaRoleArn via serviceAccount.annotations (chart creates SA)
- Add aws-load-balancer-controller.region to the install command

values-aws-s3.yaml:
- Add explicit serviceAccount names for sub-charts (must match CF trust policies)
- Add serviceAccount.annotations for Platforma SA
- Add aws-load-balancer-controller.region

advanced-installation.md:
- Clarify this is the "operator manages everything" path
- Update helm install to disable sub-charts and StorageClasses
  (managed manually in previous steps)
- README: separate CF outputs from user-supplied values in Step 5 preamble
- README: add txtOwnerId explanation
- README: fix verification checklist — controllers deploy to platforma namespace (not kube-system)
- README: fix troubleshooting CA logs namespace (kube-system → platforma)
- README: add working directory note for Step 3 (Kueue)
- README: cleanup section — define variables, replace hardcoded stack name
- CF: fix EfsFileSystemId output description (wrong values path)
- CF: update PostDeploySteps to match 6-step README structure
- advanced-installation.md: add Python 3 to prerequisites
README:
- Step 1: clarify Outputs table covers only install-relevant outputs; note remaining are infrastructure-only
- Step 1: warn subnet ID fields show ',,' by default — do not clear, it is a CF workaround
- Step 4: explain why namespace must be pre-created (license secret must exist before Platforma starts)
- Step 5: add note that chart creates Kueue queue resources (ClusterQueues, LocalQueues, ResourceFlavors)
- Step 5: move ALB wait note before verify commands; note empty ADDRESS is normal while ALB provisions

CloudFormation:
- PostDeploySteps output: move step summary into Value field (visible in Outputs tab); Description kept as pointer to README
- Step 5: remove contradicting "namespace created automatically" — namespace
  must be pre-created (Step 4); service accounts are still auto-created
- Step 5: clarify DOMAIN_FILTER is the hosted zone domain, not always the root;
  add sub-zone case and warn that wrong value causes silent ExternalDNS failure
- Cleanup: drain AppWrapper CRs before deleting the controller to prevent
  CRD finalizer hang
- CloudFormation PostDeploySteps: note "(step 1 = this stack)" so the value
  is self-contained when read in the Outputs tab
- Phase 2 summary: fix stale "namespace created automatically" — clarify
  namespace and license secret are created manually in Steps 3-4
- Step 6: port-forward note — Desktop App supports non-TLS to localhost,
  no certificate needed for this mode
- Cleanup: add data loss warning before S3/EFS delete commands
- Troubleshooting: add CloudFormation stack stuck in CREATE_IN_PROGRESS
  section covering ACM certificate validation failure (most common cause)
- CloudFormation PlatformaNamespace output: clarify it is a reference only,
  created via kubectl in Step 4, not by the stack
- Step 4: rename heading to "Create namespace and license secret"
- Step 5: remove --create-namespace flag — namespace is guaranteed by Step 4;
  if missing, helm fails with a clear error rather than silently proceeding
  without the license secret
- Troubleshooting: add "kubectl get nodes returns no nodes or NotReady" entry
  covering wrong region and IAM permission issues
- PostDeploySteps CF output: rename step 4 to "create namespace and
  license secret" to match README Step 4 heading
- Step 2: align placeholder style to use $REGION (consistent with Step 5);
  add note to set it early for reuse
- Step 2: add timing caveat — nodes may still be initializing after stack
  completes; wait ~1 min if NotReady
- Step 5: clarify values-aws-s3.yaml ships in the repo at that path
- Step 6: explain port 6345 is Platforma's gRPC port
- Step 2: add explicit REGION= assignment before kubeconfig command;
  remove the "set it now if you want" suggestion — now a directive
- Step 2: fix node count assertion to "the number you configured (2 by
  default)" so it's accurate when SystemNodeCount was changed
- Step 3: add explicit `cd infrastructure/aws` command; note that
  kueue-values.yaml is pre-configured, no edits needed
- Step 5: replace "Run from..." with cd reminder; extract DOMAIN_FILTER
  explanation from inside the code block into prose callout before it;
  simplify inline DOMAIN_FILTER comment to reference the note above
- Cleanup: explain that S3 and EFS use DeletionPolicy: Retain in the CF
  template, so manual deletion is required (not just "for data safety")
- values-aws-s3.yaml: add ingress section with placeholder comments;
  document that tls.secretName="" means ALB/ACM manages TLS
…h Lambda

EC2 launch template TagSpecifications do not support 'auto-scaling-group'
as a ResourceType — the API rejects it. EKS nodegroup Tags also do not
propagate to the underlying ASG. A Lambda-backed custom resource is the
only pure-CloudFormation way to set dynamic tag keys containing the
cluster name on ASGs for Cluster Autoscaler autodiscovery.

Changes:
- Remove NodeGroupLaunchTemplate (its sole content was the broken
  TagSpecifications block; nodegroups use EKS-managed default template)
- Remove LaunchTemplate references from all 5 nodegroups
- Add ASGTaggingRole (IAM role with eks:DescribeNodegroup and
  autoscaling:CreateOrUpdateTags permissions)
- Add ASGTaggingFunction (Python 3.12 Lambda, inline ZipFile code)
  that tags each nodegroup's ASG with k8s.io/cluster-autoscaler/enabled
  and k8s.io/cluster-autoscaler/<ClusterName> after nodegroup creation
- Add 5 custom resources (TagSystem/Ui/BatchMedium/BatchLarge/BatchXlarge
  NodeGroup) — each DependsOn its respective nodegroup
- Prerequisites: add repo clone instruction (git clone milaboratory/platforma-helm)
  so users know where infrastructure/aws/ comes from before Step 3
- Step 1: note to record the stack name — needed in Cleanup section
- Step 4: explain that platforma-license secret name and MI_LICENSE key are
  required by the chart and must not be changed
- Step 5: make directory instruction unconditional (explicit cd code block)
  rather than conditional "if you changed directories"
- Step 5: add callout after helm install explaining why --set-json is used
  for listen-ports (JSON array) and why tls.secretName="" is intentionally empty
- Cleanup: add inline fallback command to retrieve the stack name if forgotten
…mand

secretName defaults to empty in values-aws-s3.yaml; explicit --set is
redundant and raised shell quoting concerns.
- namespace parameter: warn that all CLI commands assume default 'platforma'
- Step 5 REGION variable: remove 'same value as Step 2' implication
- cd commands: clarify path is relative to git clone location
- port-forward: add kubectl get svc fallback for service name discovery
- cleanup: clarify EFS mount targets vs filesystem in deletion warning
- prerequisites: add Unix shell requirement (bash/zsh; WSL/CloudShell for Windows)
- files table: add eksctl-cluster.yaml entry
Lambda tagger now sets Name=<ClusterName>/<NodegroupName> on each ASG
with PropagateAtLaunch=true, so instances appear with descriptive names
in the EC2 console (e.g. platforma-cluster/system, platforma-cluster/batch-medium).
mike-ainsel and others added 15 commits March 4, 2026 00:37
…ilaboratory/platforma-helm into vladimir_antropov/MILAB-5645-aws
Remove manual ASG tagging loop from the infra deployer. EKS
automatically tags managed nodegroup ASGs with eks:cluster-name,
k8s.io/cluster-autoscaler/enabled, etc. — no need to replicate this
in CodeBuild.

Switch autoscaler autodiscovery from key-based matching
(tag key contains cluster name) to value-based matching
(tag=eks:cluster-name=<cluster-name>), which relies entirely on
EKS-managed tags.

Also removes the now-unnecessary IAM permissions
(eks:ListNodegroups, eks:DescribeNodegroup, autoscaling:CreateOrUpdateTags)
from the deployer role and cleans up redundant autoscaler tags from
nodegroup resources and the launch template.
…ilaboratory/platforma-helm into vladimir_antropov/MILAB-5645-aws
Chart defaults to 8 CPU request, but m6a.2xlarge nodes only have
~7910m allocatable after kubelet reservation. Set request to 4 CPU
(limit 8) so the pod can be scheduled on system nodes.
…ilaboratory/platforma-helm into vladimir_antropov/MILAB-5645-aws
- Add vpc-cni, kube-proxy, coredns as AWS::EKS::Addon for version
  governance and upgrade safety
- Rewrite LDAP config to use YAML values file instead of --set-string
  flags, fixing comma-splitting (LDAP DNs) and word-splitting (passwords)
- Move image override to YAML values file for consistency
- Switch Platforma helm deploy from --wait to --atomic
- Improve DataLibrary region parameter descriptions for cross-region clarity
Add AllowedPattern regex validation to 13 CF parameters:
- ClusterName: alphanumeric, hyphens, underscores
- LdapServer: require ldap:// or ldaps:// scheme
- S3BucketName + DataLibrary{1,2,3}Bucket: S3 naming rules
- DataLibrary{1,2,3}Region: AWS region format (incl. GovCloud)
- DataLibrary{1,2,3}AccessKey: AKIA/ASIA prefix + 16 chars
- VpcId: vpc- prefix + hex

All patterns allow empty strings where the parameter is optional.
Catches typos and format errors at stack creation time instead of
failing 20+ minutes later in CodeBuild.
…ilaboratory/platforma-helm into vladimir_antropov/MILAB-5645-aws
- Add optimized screenshots (resized to 1200px, ~50% smaller)
- Replace single cf-parameters.png with 4 section-specific screenshots
- Add cf-outputs.png and desktop-app.png
- Sync advanced-installation instance types: m5→m6a, r5→r6a
- Sync pool names: batch-medium/large/xlarge → batch-16c-64g/32c-128g/etc
- Sync autoscaler tags: k8s.io/cluster-autoscaler → eks:cluster-name
- Bump default PlatformaVersion to 3.0.0-rc.19
eksctl-cluster.yaml:
- Instance types: m5→m6a, r5→r6a (match CF)
- Pool names: batch-medium/large/xlarge → batch-16c-64g/32c-128g/etc
- Max sizes: match CF "small" tier (4/2/1/2/1)
- Remove k8s.io/cluster-autoscaler/* tags (EKS auto-tags managed node groups)
- Add platforma tag, pool-specific nodegroup-type tags

values-aws-s3.yaml:
- Add kueue.maxJobResources (62 CPU / 500Gi)
- Add app.resources (requests 4/16Gi, limits 8/16Gi) matching CF
- Add jobServiceAccount section
- Update Kueue quotas to match CF "small" tier
- CF defaults: DeployPlatforma true, PlatformaVersion 3.0.0
- Advanced guide: admin→platforma username, add --atomic --timeout 15m
- Advanced guide: fix "Helm sub-charts" → "installed via CodeBuild"
- Advanced guide: update Step 11 connect instructions
- values-aws-s3.yaml: admin→platforma in comment
- README: reorder sections, sync defaults with CF
Without Version: !Ref KubernetesVersion on node groups,
CloudFormation skips node group updates when KubernetesVersion
changes. Also add KubernetesVersion to TriggerHelmDeploy
properties so version changes auto-trigger the helm deployer.

Tested on PlatformaProd35: EKS 1.34→1.35 upgrade succeeded
with control plane, all node groups, and autoscaler v1.35.0.
- External DNS: zoneIdFilters is silently ignored by the chart.
  Use extraArgs[0]=--zone-id-filter instead (both CF and guide).
- ALB Controller: IAM policy URL was v2.11.0 but chart is v3.0.0.
  Updated to download v3.0.0 policy.
- Swapped Steps 4/5 in advanced guide: install External DNS before
  ALB Controller to avoid mutating webhook race condition.
- Added --atomic to autoscaler and External DNS helm installs.
- Added --set region and vpcId to ALB Controller install.
- Removed unused DOMAIN_FILTER variable.
Clear-writing pass on README.md and advanced-installation.md:
active voice, cut redundancy, sharpen vague phrasing.

Bug fixes from blind agent review:
- S3 bucket creation fails in us-east-1 (LocationConstraint must be omitted)
- Autoscaler image tag v1.34.0 → v1.34.2 (matches chart 9.53.0 default)
- Add password placeholder warning in Step 10
- Add session variable recovery block before Step 10
dbolotin and others added 12 commits March 6, 2026 17:46
… validation

- HostedZoneId: use AWS::Route53::HostedZone::Id type for console dropdown
- Remove NoEcho from LicenseKey and DataLibrary AccessKey params (not secrets)
- Add Rules to validate DataLibrary AccessKey/SecretKey must be set together

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
S3 buckets are retained on stack deletion for data safety.
With deterministic names (platforma-<cluster>-<account>), retrying
after a failed deployment collides with the retained bucket.

CF template: derive 8-char hex suffix from AWS::StackId UUID.
Advanced guide: use openssl rand -hex 4.
Both produce unique names per deployment attempt.

Also adds S3BucketOutput to CF Outputs and bucket name recovery
to both the Step 10 and Cleanup sections in the advanced guide.
All node groups updated across CF template, eksctl template,
README, and advanced installation guide.
…tputs

- UsersPasswordSSMConsole: direct link to SSM parameter page
- DefaultUsername: show the auto-generated username (platforma)
- All three password/username outputs conditional on AuthMethod=htpasswd
maxIO performance mode cannot be used with Elastic throughput mode
(our default). Hardcode generalPurpose and remove the parameter.
…' into vladimir_antropov/MILAB-5645-aws

# Conflicts:
#	infrastructure/aws/README.md
Diagram: remove Kueue intermediate node (show pools directly),
define S3 node before desktop link so cylinder shape renders,
keep EBS inside EKS subgraph.

Outputs: add DefaultUsername, UsersPasswordSSMConsole.
Fix stale S3 bucket description (AccountId → random).
LDAP distinguished names contain commas (e.g. cn=users,ou=groups,dc=example,dc=com).
The buildspec split search rules on commas, destroying DNs. Switch to semicolons
so multi-rule configs work without breaking DN syntax.

Also update parameter description with group filter example.
@mike-ainsel mike-ainsel changed the base branch from feat/new-chart to main March 7, 2026 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants