Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing default IMEX info fails for legacy images #797

Open
astefanutti opened this issue Nov 14, 2024 · 7 comments
Open

Parsing default IMEX info fails for legacy images #797

astefanutti opened this issue Nov 14, 2024 · 7 comments
Labels
bug Issue/PR to expose/discuss/fix a bug

Comments

@astefanutti
Copy link

Since the latest 1.17.x versions, containers with images considered "legacy" and that do not have the NVIDIA_IMEX_CHANNELS environment variable set fail to start with the following error:

Error: container create failed: time="2024-11-13T16:24:41Z" level=error msg="runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: error parsing IMEX info: unsupported IMEX channel value: all\n" 

It seems the NVIDIA_IMEX_CHANNELS environment variable is defaulted to all here for "legacy" images:

return NewVisibleDevices("all")

Which cannot be parsed by https://github.com/NVIDIA/libnvidia-container/blob/63d366ee3b4183513c310ac557bf31b05b83328f/src/cli/common.c#L446.

An occurrence of that issue has been reported here for example: pytorch/test-infra#5852.

That case should ideally be more gracefully handled.

@higi
Copy link

higi commented Nov 15, 2024

can anyone help me with this?:

2024-11-15T11:55:12Z create container gshaibi/gpu-burn:latest
2024-11-15T11:55:13Z latest Pulling from gshaibi/gpu-burn
2024-11-15T11:55:13Z Digest: sha256:ed07993b0581228c2bd7113fae0ed214549547f0fa91ba50165bc2473cfaf979
2024-11-15T11:55:13Z Status: Image is up to date for gshaibi/gpu-burn:latest
2024-11-15T11:55:14Z start container for gshaibi/gpu-burn:latest: begin
2024-11-15T11:55:14Z error starting container: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: error parsing IMEX info: unsupported IMEX channel value: all: unknown
2024-11-15T11:55:30Z start container for gshaibi/gpu-burn:latest: begin
2024-11-15T11:55:30Z error starting container: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: error parsing IMEX info: unsupported IMEX channel value: all: unknown

tested on NVIDIA Container Toolkit CLI version 1.17.1

@markjolah
Copy link

Possible WAR. Set NVIDIA_IMEX_CHANNELS=0 or empty string.

docker run ... -e NVIDIA_IMEX_CHANNELS=0 ...

Or for k8s Pod Spec, set:

    env:
    - name: NVIDIA_IMEX_CHANNELS
      value: 0

@elezar
Copy link
Member

elezar commented Nov 15, 2024

We have just released v1.17.2 that should address this issue. Please let us know if the problem persists.

@elezar elezar added the bug Issue/PR to expose/discuss/fix a bug label Nov 15, 2024
@higi
Copy link

higi commented Nov 16, 2024

We have just released v1.17.2 that should address this issue. Please let us know if the problem persists.

Now i am on, but its another problem think. Nvidia-smi showing all gpus without problem.

2024-11-16T09:26:40.108376033Z Failed to initialize NVML: Unknown Error
2024-11-16T09:26:40.207661945Z terminate called after throwing an instance of 'std::string'
2024-11-16T09:26:40.302441418Z No CUDA devices
2024-11-16T09:26:45.770657675Z Failed to initialize NVML: Unknown Error
2024-11-16T09:26:45.855912077Z terminate called after throwing an instance of 'std::string'
2024-11-16T09:26:45.957591526Z No CUDA devices

@higi
Copy link

higi commented Nov 16, 2024

Fixed it by:
sudo vim /etc/nvidia-container-runtime/config.toml, then changed no-cgroups = false, save

Think version 1.17.2 fixed problem with imex channel. Many thx for quick fix!!

@elezar
Copy link
Member

elezar commented Nov 29, 2024

@higi do you know why no-cgroups was set to true?

@higi
Copy link

higi commented Nov 29, 2024

@higi do you know why no-cgroups was set to true?

I think its just for nvml test, for ai tools. Nvml test doesnt work without settings this.

This script fixed it wget https://raw.githubusercontent.com/jjziets/vasttools/main/nvml_fix.py. an,way im3x channel error was fixed by your fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue/PR to expose/discuss/fix a bug
Projects
None yet
Development

No branches or pull requests

4 participants