Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mi300 runner for toy_llama tests #19961

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 11 additions & 1 deletion .github/workflows/pkgci_test_sharktank.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ jobs:
test_sharktank_models:
name: "test_sharktank_models :: ${{ matrix.name }}"
runs-on: ${{ matrix.runs-on }}
# Dynamically assign container for `linux-mi300-gpu-1`
container: ${{ fromJSON(toJSON(matrix.container || {})) }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a syntax error here: https://github.com/iree-org/iree/actions/runs/13272021471

Invalid workflow file: .github/workflows/pkgci.yml#L111
The workflow is not valid. In .github/workflows/pkgci.yml (Line: 111, Col: 11): Error from called workflow iree-org/iree/.github/workflows/pkgci_test_sharktank.yml@29df51be7e68c626460b68d7e9a80f654091adb8 (Line: 25, Col: 16): Unexpected symbol: '{}'. Located at position 37 within expression: fromJSON(toJSON(matrix.container || {}))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed what should be a fix

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far so good 🤞

strategy:
fail-fast: false
matrix:
Expand All @@ -30,11 +32,19 @@ jobs:
gpu: none
runs-on: ubuntu-24.04

- name: hip_task
- name: hip_task_w7900
target: target_hip
gpu: gfx1100
runs-on: nodai-amdgpu-w7900-x86-64

- name: hip_task_mi300
target: target_hip
gpu: gfx942
runs-on: linux-mi300-gpu-1
Comment on lines +42 to +45
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like these runners need GPU drivers, either preinstalled or via Docker?

https://github.com/iree-org/iree/actions/runs/13270996475/job/37050767466?pr=19961#step:8:330

ERROR iree-test-suites/sharktank_models/llama3.1/test_llama.py::test_prefill[hip] - RuntimeError: Error creating driver: iree/runtime/src/iree/hal/drivers/hip/dynamic_symbols.c:160: UNAVAILABLE; HIP runtime library 'amdhip64.dll'/'libamdhip64.so' not available: please ensure installed and in dynamic library search path: 
  Tried: libamdhip64.so
    iree/runtime/src/iree/base/internal/dynamic_library_posix.c:165: NOT_FOUND; failed to load dynamic library (possibly not found on any search path): libamdhip64.so: cannot open shared object file: No such file or directory

See what this other workflow does:

jobs:
test_mi300:
runs-on: linux-mi300-gpu-1
container:
image: rocm/dev-ubuntu-22.04:6.3
options: --user root --device=/dev/kfd --device=/dev/dri --ipc=host --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined

cc @yamiyysu

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the line to register container for mi300 runner

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still missing some deps, like these:

- name: "Install dependencies"
run: |
sudo apt-get update
sudo apt-get install -y cmake ninja-build clang lld git

Trying to install those will run into #19955 though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, latest run hit that error

container:
image: rocm/dev-ubuntu-22.04:6.3
options: --user root --device=/dev/kfd --device=/dev/dri --ipc=host --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined

env:
VENV_DIR: ${{ github.workspace }}/venv
steps:
Expand Down
Loading