Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error attempting to trace iree-run-module #19904

Open
stbaione opened this issue Feb 4, 2025 · 2 comments
Open

Error attempting to trace iree-run-module #19904

stbaione opened this issue Feb 4, 2025 · 2 comments
Labels
bug 🐞 Something isn't working runtime/tools IREE's runtime tooling (iree-run-module, iree-benchmark-module, etc) runtime Relating to the IREE runtime library

Comments

@stbaione
Copy link

stbaione commented Feb 4, 2025

What happened?

I'm getting the following error attempting to use iree-tracy-capture across multiple iree-run-module invocations with varying weights/inputs:

Connecting to 127.0.0.1:8086...iree-tracy-capture: /home/stbaione/repos/SHARK-Platform/iree/third_party/tracy/server/TracyWorker.cpp:4817: void tracy::Worker::ProcessZoneBeginCallstack(const tracy::QueueZoneBegin &): Assertion `it != m_nextCallstack.end()' failed.
Aborted (core dumped)

Steps to reproduce your issue

The weight files in /data/llama3.1/weights/8b/fp16/tp8/ on mi300x-3 or /shark_dev/data/llama3.1/weights/8b/fp16/tp8/ on mi300x can be used.

  1. Download the mlir
wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_tracy_issue/llama3.1_8b_tp8.mlir
  1. Download inputs
wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_tracy_issue/tokens.npy
wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_tracy_issue/seq_ids.npy
wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_tracy_issue/seq_block_ids.npy
wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_tracy_issue/cache_state_shard_0.npy
wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_tracy_issue/cache_state_shard_1.npy
wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_tracy_issue/cache_state_shard_2.npy
wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_tracy_issue/cache_state_shard_3.npy
wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_tracy_issue/cache_state_shard_4.npy
wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_tracy_issue/cache_state_shard_5.npy
wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_tracy_issue/cache_state_shard_6.npy
wget https://sharkpublic.blob.core.windows.net/sharkpublic/stephen/llama3.1_tracy_issue/cache_state_shard_7.npy
  1. Build IREE
cmake -G Ninja \
  -B ../iree-build/ \
  -S . \
  -DCMAKE_BUILD_TYPE=RelWithDebInfo \
  -DIREE_ENABLE_RUNTIME_TRACING=ON \
  -DIREE_ENABLE_ASSERTIONS=ON \
  -DIREE_ENABLE_SPLIT_DWARF=ON \
  -DIREE_ENABLE_THIN_ARCHIVES=ON \
  -DCMAKE_C_COMPILER=clang \
  -DCMAKE_CXX_COMPILER=clang++ \
  -DIREE_ENABLE_LLD=ON \
  -DCMAKE_C_COMPILER_LAUNCHER=ccache \
  -DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
  -DIREE_HAL_DRIVER_HIP=ON \
  -DIREE_TARGET_BACKEND_ROCM=ON \
  -DIREE_BUILD_PYTHON_BINDINGS=ON \
  -DPython3_EXECUTABLE="$(which python)" \
  -DIREE_BUILD_TRACY=ON \
  -DIREE_TRACING_MODE=4 \
  -DIREE_LINK_COMPILER_SHARED_LIBRARY=OFF
cmake --build ../iree-build
  1. Compile to VMFB
iree-compile llama3.1_8b_tp8.mlir -o llama3.1_8b_tp8.vmfb --iree-hal-target-device=hip[0]  --iree-hal-target-device=hip[1] --iree-hal-target-device=hip[2] --iree-hal-target-device=hip[3] --iree-hal-target-device=hip[4] --iree-hal-target-device=hip[5] --iree-hal-target-device=hip[6] --iree-hal-target-device=hip[7]  --iree-hip-target=gfx942  --iree-dispatch-creation-enable-aggressive-fusion=true      --iree-global-opt-propagate-transposes=true   --iree-opt-aggressively-propagate-transposes=true      --iree-opt-data-tiling=false   --iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-preprocessing-generalize-linalg-matmul-experimental))'      --iree-hal-indirect-command-buffers=true   --iree-stream-resource-memory-model=discrete  --iree-hal-memoization=true   --iree-opt-strip-assertions
  1. Start tracy-capture
../iree-build/tracy/iree-tracy-capture -o 8b_run_module.tracy
  1. Run iree-run-module
ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 iree-run-module --hip_use_streams=true --module=llama3.1_8b_tp8.vmfb \
 --parameters=model=/data/llama3.1/weights/8b/fp16/tp8/llama3.1_8b_fp16_tp8_parameters.irpa --parameters=model=/data/llama3.1/weights/8b/fp16/tp8/llama3.1_8b_fp16_tp8_parameters.rank0.irpa --parameters=model=/data/llama3.1/weights/8b/fp16/tp8/llama3.1_8b_fp16_tp8_parameters.rank1.irpa --parameters=model=/data/llama3.1/weights/8b/fp16/tp8/llama3.1_8b_fp16_tp8_parameters.rank2.irpa --parameters=model=/data/llama3.1/weights/8b/fp16/tp8/llama3.1_8b_fp16_tp8_parameters.rank3.irpa --parameters=model=/data/llama3.1/weights/8b/fp16/tp8/llama3.1_8b_fp16_tp8_parameters.rank4.irpa --parameters=model=/data/llama3.1/weights/8b/fp16/tp8/llama3.1_8b_fp16_tp8_parameters.rank5.irpa --parameters=model=/data/llama3.1/weights/8b/fp16/tp8/llama3.1_8b_fp16_tp8_parameters.rank6.irpa --parameters=model=/data/llama3.1/weights/8b/fp16/tp8/llama3.1_8b_fp16_tp8_parameters.rank7.irpa  \
 --device=hip://0 --device=hip://1 --device=hip://2 --device=hip://3 --device=hip://4 --device=hip://5 --device=hip://6 --device=hip://7 \
 --function=prefill_bs4 \
 [email protected] --input=@seq_ids.npy --input=@seq_block_ids.npy --input=@cache_state_shard_0.npy --input=@cache_state_shard_1.npy --input=@cache_state_shard_2.npy --input=@cache_state_shard_3.npy --input=@cache_state_shard_4.npy --input=@cache_state_shard_5.npy --input=@cache_state_shard_6.npy --input=@cache_state_shard_7.npy

What component(s) does this issue relate to?

Other

Version information

eb19497

Additional context

No response

@stbaione stbaione added the bug 🐞 Something isn't working label Feb 4, 2025
@sogartar sogartar added runtime Relating to the IREE runtime library runtime/tools IREE's runtime tooling (iree-run-module, iree-benchmark-module, etc) labels Feb 4, 2025
@stbaione
Copy link
Author

stbaione commented Feb 4, 2025

Rebuilt with the following build config:

cmake -G Ninja \
  -B ../iree-build/ \
  -S . \
  -DCMAKE_BUILD_TYPE=RelWithDebInfo \
  -DIREE_ENABLE_RUNTIME_TRACING=ON \
  -DIREE_ENABLE_ASSERTIONS=ON \
  -DIREE_ENABLE_SPLIT_DWARF=ON \
  -DIREE_ENABLE_THIN_ARCHIVES=ON \
  -DCMAKE_C_COMPILER=clang \
  -DCMAKE_CXX_COMPILER=clang++ \
  -DIREE_ENABLE_LLD=ON \
  -DCMAKE_C_COMPILER_LAUNCHER=ccache \
  -DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
  -DIREE_HAL_DRIVER_HIP=ON \
  -DIREE_TARGET_BACKEND_ROCM=ON \
  -DIREE_BUILD_PYTHON_BINDINGS=ON \
  -DPython3_EXECUTABLE="$(which python)" \
  -DIREE_BUILD_TRACY=ON

And it seems to be working again. Maybe it had to do with -DIREE_TRACING_MODE=4?

@stbaione
Copy link
Author

stbaione commented Feb 4, 2025

Weirdly, -DIREE_TRACING_MODE=4 seems like it's also causing a compile issue:

llama_2_3.mlir:3:1: error: cannot find ROCM bitcode files. Check your installation consistency and in the worst case, set --iree-hip-bc-dir= to a path on your system.
module @module {
^
llama_2_3.mlir:3:1: note: see current operation: 

Started hitting that error and when I rebuilt without that CLI option, compile started working again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working runtime/tools IREE's runtime tooling (iree-run-module, iree-benchmark-module, etc) runtime Relating to the IREE runtime library
Projects
None yet
Development

No branches or pull requests

2 participants