Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama 3.1 Instruct 8B TP8 with python package not reporting anything with iree-run-module/iree-benchmark-module #19886

Open
aviator19941 opened this issue Feb 3, 2025 · 17 comments
Labels
bug 🐞 Something isn't working

Comments

@aviator19941
Copy link
Contributor

aviator19941 commented Feb 3, 2025

What happened?

I'm trying to run Llama 3.1 Instruct 8b with tensor parallelism size of 8 using the iree python package. I am able to compile and benchmark with a local build of iree, however, using the same command with the iree-runtime python package exits and does not report anything when running `iree-run-module/iree-benchmark-module'. I'm not sure if this is expected or not.

Steps to reproduce your issue

  1. pip install -f https://iree.dev/pip-release-links.html --upgrade --pre
    iree-base-compiler iree-base-runtime iree-turbine
  2. Command with python package iree-benchmark-module:
ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 iree-benchmark-module --hip_use_streams=true --module=f16_torch_128_tp8.vmfb --parameters=model=/shark-dev/8b/instruct/weights/tp8/llama3.1_8b_instruct_fp16_tp8.irpa --parameters=model=/shark-dev/8b/instruct/weights/tp8/llama3.1_8b_instruct_fp16_tp8.rank0.irpa --parameters=model=/shark-dev/8b/instruct/weights/tp8/llama3.1_8b_instruct_fp16_tp8.rank1.irpa --parameters=model=/shark-dev/8b/instruct/weights/tp8/llama3.1_8b_instruct_fp16_tp8.rank2.irpa --parameters=model=/shark-dev/8b/instruct/weights/tp8/llama3.1_8b_instruct_fp16_tp8.rank3.irpa --parameters=model=/shark-dev/8b/instruct/weights/tp8/llama3.1_8b_instruct_fp16_tp8.rank4.irpa --parameters=model=/shark-dev/8b/instruct/weights/tp8/llama3.1_8b_instruct_fp16_tp8.rank5.irpa --parameters=model=/shark-dev/8b/instruct/weights/tp8/llama3.1_8b_instruct_fp16_tp8.rank6.irpa --parameters=model=/shark-dev/8b/instruct/weights/tp8/llama3.1_8b_instruct_fp16_tp8.rank7.irpa --device=hip://0 --device=hip://1 --device=hip://2 --device=hip://3 --device=hip://4 --device=hip://5 --device=hip://6 --device=hip://7 --function=prefill_bs4 --input=@/shark-dev/8b/prefill_args_bs4_128_stride_32_tp8/tokens.npy --input=@/shark-dev/8b/prefill_args_bs4_128_stride_32_tp8/seq_lens.npy --input=@/shark-dev/8b/prefill_args_bs4_128_stride_32_tp8/seq_block_ids.npy --input=@/shark-dev/8b/prefill_args_bs4_128_stride_32_tp8/cs_f16_shard_0.npy --input=@/shark-dev/8b/prefill_args_bs4_128_stride_32_tp8/cs_f16_shard_1.npy --input=@/shark-dev/8b/prefill_args_bs4_128_stride_32_tp8/cs_f16_shard_2.npy --input=@/shark-dev/8b/prefill_args_bs4_128_stride_32_tp8/cs_f16_shard_3.npy --input=@/shark-dev/8b/prefill_args_bs4_128_stride_32_tp8/cs_f16_shard_4.npy --input=@/shark-dev/8b/prefill_args_bs4_128_stride_32_tp8/cs_f16_shard_5.npy --input=@/shark-dev/8b/prefill_args_bs4_128_stride_32_tp8/cs_f16_shard_6.npy --input=@/shark-dev/8b/prefill_args_bs4_128_stride_32_tp8/cs_f16_shard_7.npy --benchmark_repetitions=3
  1. Benchmark exits without any report:
2025-02-03T10:40:53-08:00
Running /home/nod/avsharma/SHARK-Platform/.venv_2_3/lib/python3.11/site-packages/iree/_runtime_libs/iree-benchmark-module
Run on (96 X 3810.79 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x96)
  L1 Instruction 32 KiB (x96)
  L2 Unified 1024 KiB (x96)
  L3 Unified 32768 KiB (x16)
Load Average: 2.52, 2.38, 2.23
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
  1. Command with source iree build iree-benchmark-module:
ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ../iree-build-no-trace/tools/iree-benchmark-module --hip_use_streams=true --module=f16_torch_128_tp8.vmfb --parameters=model=/shark-dev/8b/instruct/weights/tp8/llama3.1_8b_instruct_fp16_tp8.irpa --parameters=model=/shark-dev/8b/instruct/weights/tp8/llama3.1_8b_instruct_fp16_tp8.rank0.irpa --parameters=model=/shark-dev/8b/instruct/weights/tp8/llama3.1_8b_instruct_fp16_tp8.rank1.irpa --parameters=model=/shark-dev/8b/instruct/weights/tp8/llama3.1_8b_instruct_fp16_tp8.rank2.irpa --parameters=model=/shark-dev/8b/instruct/weights/tp8/llama3.1_8b_instruct_fp16_tp8.rank3.irpa --parameters=model=/shark-dev/8b/instruct/weights/tp8/llama3.1_8b_instruct_fp16_tp8.rank4.irpa --parameters=model=/shark-dev/8b/instruct/weights/tp8/llama3.1_8b_instruct_fp16_tp8.rank5.irpa --parameters=model=/shark-dev/8b/instruct/weights/tp8/llama3.1_8b_instruct_fp16_tp8.rank6.irpa --parameters=model=/shark-dev/8b/instruct/weights/tp8/llama3.1_8b_instruct_fp16_tp8.rank7.irpa --device=hip://0 --device=hip://1 --device=hip://2 --device=hip://3 --device=hip://4 --device=hip://5 --device=hip://6 --device=hip://7 --function=prefill_bs4 --input=@/shark-dev/8b/prefill_args_bs4_128_stride_32_tp8/tokens.npy --input=@/shark-dev/8b/prefill_args_bs4_128_stride_32_tp8/seq_lens.npy --input=@/sharkdev/8b/prefill_args_bs4_128_stride_32_tp8/seq_block_ids.npy --input=@/shark-dev/8b/prefill_args_bs4_128_stride_32_tp8/cs_f16_shard_0.npy --input=@/shark-dev/8b/prefill_args_bs4_128_stride_32_tp8/cs_f16_shard_1.npy --input=@/shark-dev/8b/prefill_args_bs4_128_stride_32_tp8/cs_f16_shard_2.npy --input=@/shark-dev/8b/prefill_args_bs4_128_stride_32_tp8/cs_f16_shard_3.npy --input=@/shark-dev/8b/prefill_args_bs4_128_stride_32_tp8/cs_f16_shard_4.npy --input=@/shark-dev/8b/prefill_args_bs4_128_stride_32_tp8/cs_f16_shard_5.npy --input=@/shark-dev/8b/prefill_args_bs4_128_stride_32_tp8/cs_f16_shard_6.npy --input=@/shark-dev/8b/prefill_args_bs4_128_stride_32_tp8/cs_f16_shard_7.npy --benchmark_repetitions=3
  1. Benchmark reports output fine:
Running ../iree-build-no-trace/tools/iree-benchmark-module
Run on (96 X 3810.79 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x96)
  L1 Instruction 32 KiB (x96)
  L2 Unified 1024 KiB (x96)
  L3 Unified 32768 KiB (x16)
Load Average: 4.19, 6.13, 4.15
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
***WARNING*** Library was built as DEBUG. Timings may be affected.
-------------------------------------------------------------------------------------------------------
Benchmark                                             Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------
BM_prefill_bs4/process_time/real_time              48.1 ms          454 ms           12 items_per_second=20.8012/s
BM_prefill_bs4/process_time/real_time              46.2 ms          453 ms           12 items_per_second=21.6276/s
BM_prefill_bs4/process_time/real_time              45.5 ms          463 ms           12 items_per_second=21.9751/s
BM_prefill_bs4/process_time/real_time_mean         46.6 ms          457 ms            3 items_per_second=21.468/s
BM_prefill_bs4/process_time/real_time_median       46.2 ms          454 ms            3 items_per_second=21.6276/s
BM_prefill_bs4/process_time/real_time_stddev       1.32 ms         5.66 ms            3 items_per_second=0.602985/s
BM_prefill_bs4/process_time/real_time_cv           2.84 %          1.24 %             3 items_per_second=2.81%

What component(s) does this issue relate to?

Runtime

Version information

iree-base-compiler-3.2.0rc20250203
iree-base-runtime-3.2.0rc20250203
iree-turbine-3.2.0rc20250203

Additional context

No response

@aviator19941 aviator19941 added the bug 🐞 Something isn't working label Feb 3, 2025
@aviator19941
Copy link
Contributor Author

aviator19941 commented Feb 3, 2025

Directly specifying the iree-benchmark-module in the python package showed a Segmentation fault (/home/nod/avsharma/SHARK-Platform/.venv_2_3/lib/python3.11/site-packages/iree/_runtime_libs/iree-benchmark-module). This is the stack trace from it:

iree_vm_bytecode_dispatch (@iree_vm_bytecode_dispatch:497)
iree_vm_bytecode_module_begin_call (@iree_vm_bytecode_module_begin_call:259)
iree_vm_begin_invoke (@iree_vm_begin_invoke:287)
iree_vm_invoke (@iree_vm_invoke:29)
benchmark::internal::LambdaBenchmark<iree::(anonymous namespace)::RegisterGenericBenchmark(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, iree_hal_device_t*, iree_vm_context_t*, iree_vm_function_t, iree_vm_list_t*)::$_0>::Run(benchmark::State&) (@benchmark::internal::LambdaBenchmark<iree::(anonymous namespace)::RegisterGenericBenchmark(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, iree_hal_device_t*, iree_vm_context_t*, iree_vm_function_t, iree_vm_list_t*)::$_0>::Run(benchmark::State&):56)
benchmark::internal::BenchmarkInstance::Run(long, int, benchmark::internal::ThreadTimer*, benchmark::internal::ThreadManager*, benchmark::internal::PerfCountersMeasurement*) const (@benchmark::internal::BenchmarkInstance::Run(long, int, benchmark::internal::ThreadTimer*, benchmark::internal::ThreadManager*, benchmark::internal::PerfCountersMeasurement*) const:84)
benchmark::internal::(anonymous namespace)::RunInThread(benchmark::internal::BenchmarkInstance const*, long, int, benchmark::internal::ThreadManager*, benchmark::internal::PerfCountersMeasurement*) (@benchmark::internal::(anonymous namespace)::RunInThread(benchmark::internal::BenchmarkInstance const*, long, int, benchmark::internal::ThreadManager*, benchmark::internal::PerfCountersMeasurement*):23)
benchmark::internal::BenchmarkRunner::DoNIterations() (@benchmark::internal::BenchmarkRunner::DoNIterations():183)
benchmark::internal::BenchmarkRunner::DoOneRepetition() (@benchmark::internal::BenchmarkRunner::DoOneRepetition():89)
benchmark::RunSpecifiedBenchmarks(benchmark::BenchmarkReporter*, benchmark::BenchmarkReporter*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>) (@benchmark::RunSpecifiedBenchmarks(benchmark::BenchmarkReporter*, benchmark::BenchmarkReporter*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>):566)
benchmark::RunSpecifiedBenchmarks() (@benchmark::RunSpecifiedBenchmarks():41)
main (@main:646)
__libc_start_call_main (@__libc_start_call_main:29)
__libc_start_main_impl (@__libc_start_main@@GLIBC_2.34:43)
_start (@_start:15)

It does not contain debug symbols, so we cannot debug the issue with this python package. We need to be able to build the python package exactly the same way as the CI except with debug symbols in order to debug this issue. The Python wrapper blocks the Segmentation fault from showing up as well.

@aviator19941
Copy link
Contributor Author

aviator19941 commented Feb 4, 2025

Specifying clang-17 and clang++-17 in the cmake build command results in a segfault:

cmake -G Ninja -B ../iree-build-test -S . -DCMAKE_BUILD_TYPE=Release -DIREE_RUNTIME_OPTIMIZATION_PROFILE=lto -DIREE_BUILD_PYTHON_BINDINGS=ON -DIREE_BUILD_COMPILER=OFF -DIREE_BUILD_SAMPLES=OFF -DIREE_BUILD_TESTS=OFF -DIREE_HAL_DRIVER_VULKAN=ON -DIREE_HAL_DRIVER_CUDA=ON -DIREE_HAL_DRIVER_HIP=ON -UIREE_EXTERNAL_HAL_DRIVERS -DIREE_ENABLE_CPUINFO=ON -DPython3_EXECUTABLE="$(which python3)" -DIREE_RUNTIME_BUILD_TRACY_TOOLS=ON -DIREE_RUNTIME_BUILD_TRACY=ON -DCMAKE_C_COMPILER=clang-17 -DCMAKE_CXX_COMPILER=clang++-17 && cmake --build ../iree-build-test

@aviator19941
Copy link
Contributor Author

aviator19941 commented Feb 4, 2025

clang-17 with Release with -DIREE_RUNTIME_OPTIMIZATION_PROFILE=lto segfaults
clang-17 with Release without -DIREE_RUNTIME_OPTIMIZATION_PROFILE=lto works
clang-17 with RelWithDebInfo with and without -DIREE_RUNTIME_OPTIMIZATION_PROFILE=lto works

@aviator19941
Copy link
Contributor Author

clang-14 with Release with -DIREE_RUNTIME_OPTIMIZATION_PROFILE=lto works

@aviator19941
Copy link
Contributor Author

@benvanik Do you have an idea on what could be causing this issue? I was thinking of trying to run with trace_execution flag to see where the failure happens.

@benvanik
Copy link
Collaborator

benvanik commented Feb 4, 2025

no clue, but that's where I'd start - if this were riscv I'd say something about unaligned data, but otherwise I haven't seen a stack like that before

@benvanik
Copy link
Collaborator

benvanik commented Feb 4, 2025

(it's possible it's some corner case that has been lying in wait, if that input is complex enough - register exhaustion or something that's not getting caught properly - but --trace_execution is the easiest way to spot that too)

@aviator19941
Copy link
Contributor Author

aviator19941 commented Feb 4, 2025

Ah yeah I think the unfortunate thing is it is only reproducible in Release mode, so I can't debug with trace_execution (since it's only available in Debug mode). I probably just need to use print statements in the above stack trace to track down the error.

@benvanik
Copy link
Collaborator

benvanik commented Feb 4, 2025

runtime/src/iree/base/config.h line 272 - or add cflag -DIREE_VM_EXECUTION_TRACING_FORCE_ENABLE=1

@aviator19941
Copy link
Contributor Author

I have the trace_execution log here: https://gist.github.com/aviator19941/ec72f8c818a6f4cb4b056d44af3036f6

@benvanik
Copy link
Collaborator

benvanik commented Feb 4, 2025

Interesting; this is likely a duplicate of #19795 which I was unable to figure out. In this case and that one it's an overwrite of an input argument with null doing something that is unidentifiable. I was unable to reproduce locally, and your compiler configurations may indicate why: I certainly wasn't building with release/lto.

@benvanik
Copy link
Collaborator

benvanik commented Feb 5, 2025

(in that issue if I added printfs in random places or enabled ASAN it'd fix things - it definitely smells like undefined behavior or a compilation error to me)

@aviator19941
Copy link
Contributor Author

Enabling UBSan also fixes things. I can see the hal.buffer_view output, although it does show runtime errors as well: https://gist.github.com/aviator19941/5e0fb6e8f51bb2ad60ffdead706aaa5a

@aviator19941
Copy link
Contributor Author

aviator19941 commented Feb 6, 2025

@benvanik Is there a way to get debug symbols in Release mode? Or other cflags similar to -DIREE_VM_EXECUTION_TRACING_FORCE_ENABLE=1 that can show more debug info?

Smaller repro with Onnx test: https://github.com/iree-org/iree-test-suites/tree/main/onnx_ops/onnx/node/generated/test_eyelike_without_dtype

cmake -G Ninja -B ../iree-build-rel-opt -S . -DCMAKE_BUILD_TYPE=Release -DIREE_RUNTIME_OPTIMIZATION_PROFILE=lto -DIREE_BUILD_PYTHON_BINDINGS=ON -DIREE_BUILD_COMPILER=OFF -DIREE_BUILD_SAMPLES=OFF -DIREE_BUILD_TESTS=OFF -DIREE_HAL_DRIVER_VULKAN=ON -DIREE_HAL_DRIVER_CUDA=ON -DIREE_HAL_DRIVER_HIP=ON -UIREE_EXTERNAL_HAL_DRIVERS -DIREE_ENABLE_CPUINFO=ON -DPython3_EXECUTABLE="$(which python3)" -DIREE_RUNTIME_BUILD_TRACY_TOOLS=ON -DIREE_RUNTIME_BUILD_TRACY=ON -DCMAKE_C_COMPILER=clang-17 -DCMAKE_CXX_COMPILER=clang++-17 && cmake --build ../iree-build-rel-opt

iree-compile model.mlir --iree-hip-target=gfx942 -o=model.vmfb --iree-hal-target-device=hip

/home/nod/avsharma/iree-build-rel-opt/tools/iree-run-module --module=model.vmfb --device=hip --function=test_eyelike_without_dtype --input=4x4xi32=@input_0.bin

@benvanik
Copy link
Collaborator

benvanik commented Feb 6, 2025

RelWithDebInfo or adding -g to your compiler flags will add debug symbols

@aviator19941
Copy link
Contributor Author

I wasn't able to repro the issue with RelWithDebInfo, only Release mode, so I'll try -g to see if that helps.

@aviator19941
Copy link
Contributor Author

aviator19941 commented Feb 6, 2025

Other weird things, using -g (-DCMAKE_CXX_FLAGS=-g -DCMAKE_C_FLAGS=-g) with Release I wasn't able to repro the issue. If adding too many printfs, the segfault goes away as stated in Ben's issue above.

Using IREE_STATUS_MODE=3 I wasn't able to repro the issue as well. Seems like the LTO flags are specified here.

cmake -G Ninja -B ../iree-build-rel-opt -S . \
  -DCMAKE_BUILD_TYPE=Release \
  -DIREE_STATUS_MODE=3 \
  -DIREE_BUILD_PYTHON_BINDINGS=ON \
  -DIREE_BUILD_COMPILER=OFF \
  -DIREE_BUILD_SAMPLES=OFF \
  -DIREE_BUILD_TESTS=OFF \
  -DIREE_HAL_DRIVER_VULKAN=ON \
  -DIREE_HAL_DRIVER_CUDA=ON \
  -DIREE_HAL_DRIVER_HIP=ON \
  -UIREE_EXTERNAL_HAL_DRIVERS \
  -DIREE_ENABLE_CPUINFO=ON \
  -DPython3_EXECUTABLE="$(which python3)" \
  -DIREE_RUNTIME_BUILD_TRACY_TOOLS=ON \
  -DIREE_RUNTIME_BUILD_TRACY=ON \
  -DCMAKE_C_COMPILER=clang-17 \
  -DCMAKE_CXX_COMPILER=clang++-17

Seems like there might be a race condition somewhere? I'm running out of ideas on how to get debug symbols while reproducing the segfault. Rob said there was some compiler magic that can dump symbols to a separate file in Release mode, but I'm not sure how to do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants