Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support LNC with trn2 #159

Merged
merged 3 commits into from
Feb 4, 2025
Merged

Support LNC with trn2 #159

merged 3 commits into from
Feb 4, 2025

Conversation

movence
Copy link
Contributor

@movence movence commented Jan 9, 2025

LNC (Logical Neuron Cores), concept to represents multiple Neuron cores as a single Neuron core, is a new feature supported with trn2. This requires a new volume mount for nueron-monitor to get LNC configuration.

Description of changes:

  • Bump neuron-monitor image to 1.3.0
  • Add new volume mount for /opt
  • Update GOMEMLIMIT to 320MiB

Tested by deploying the changes to a test cluster:

# describe neuron monitor (some truncated)

Containers:
  neuron-monitor:
    Image:         public.ecr.aws/neuron/neuron-monitor:1.3.0
    Image ID:      public.ecr.aws/neuron/neuron-monitor@sha256:[sha]
    Port:          8000/TCP
    Host Port:     0/TCP
    ...
    State:          Running
      Started:      Thu, 09 Jan 2025 13:03:44 -0500
    Ready:          True
    Limits:
      cpu:     500m
      memory:  256Mi
    Requests:
      cpu:     256m
      memory:  128Mi
    Environment:
      NODE_NAME:    (v1:spec.nodeName)
      PATH:        /usr/local/bin:/usr/bin:/bin:/opt/aws/neuron/bin
      GOMEMLIMIT:  320MiB
    Mounts:
      /etc/amazon-cloudwatch-observability-neuron-cert/ from neurontls (ro)
      /etc/neuron-monitor-config from neuron-monitor-config (rw)
      /opt-aws from aws-config (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hxmsx (ro)
    Volumes:
      aws-config:
        Type:          HostPath (bare host directory volume)
        Path:          /opt/aws
        HostPathType:


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Contributor

@mounchin mounchin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@movence movence merged commit 497e498 into main Feb 4, 2025
3 checks passed
@movence movence deleted the trn2-support branch February 4, 2025 17:27
@mitali-salvi mitali-salvi mentioned this pull request Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants