-
Notifications
You must be signed in to change notification settings - Fork 658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llama 3.1 Instruct 8B TP8 with python package not reporting anything with iree-run-module/iree-benchmark-module #19886
Comments
Directly specifying the iree-benchmark-module in the python package showed a Segmentation fault (
It does not contain debug symbols, so we cannot debug the issue with this python package. We need to be able to build the python package exactly the same way as the CI except with debug symbols in order to debug this issue. The Python wrapper blocks the Segmentation fault from showing up as well. |
Specifying clang-17 and clang++-17 in the cmake build command results in a segfault:
|
|
|
@benvanik Do you have an idea on what could be causing this issue? I was thinking of trying to run with trace_execution flag to see where the failure happens. |
no clue, but that's where I'd start - if this were riscv I'd say something about unaligned data, but otherwise I haven't seen a stack like that before |
(it's possible it's some corner case that has been lying in wait, if that input is complex enough - register exhaustion or something that's not getting caught properly - but --trace_execution is the easiest way to spot that too) |
Ah yeah I think the unfortunate thing is it is only reproducible in |
runtime/src/iree/base/config.h line 272 - or add cflag -DIREE_VM_EXECUTION_TRACING_FORCE_ENABLE=1 |
I have the trace_execution log here: https://gist.github.com/aviator19941/ec72f8c818a6f4cb4b056d44af3036f6 |
Interesting; this is likely a duplicate of #19795 which I was unable to figure out. In this case and that one it's an overwrite of an input argument with null doing something that is unidentifiable. I was unable to reproduce locally, and your compiler configurations may indicate why: I certainly wasn't building with release/lto. |
(in that issue if I added printfs in random places or enabled ASAN it'd fix things - it definitely smells like undefined behavior or a compilation error to me) |
Enabling UBSan also fixes things. I can see the hal.buffer_view output, although it does show runtime errors as well: https://gist.github.com/aviator19941/5e0fb6e8f51bb2ad60ffdead706aaa5a |
@benvanik Is there a way to get debug symbols in Release mode? Or other cflags similar to Smaller repro with Onnx test: https://github.com/iree-org/iree-test-suites/tree/main/onnx_ops/onnx/node/generated/test_eyelike_without_dtype
|
RelWithDebInfo or adding -g to your compiler flags will add debug symbols |
I wasn't able to repro the issue with RelWithDebInfo, only Release mode, so I'll try -g to see if that helps. |
Other weird things, using -g ( Using IREE_STATUS_MODE=3 I wasn't able to repro the issue as well. Seems like the LTO flags are specified here.
Seems like there might be a race condition somewhere? I'm running out of ideas on how to get debug symbols while reproducing the segfault. Rob said there was some compiler magic that can dump symbols to a separate file in Release mode, but I'm not sure how to do it. |
What happened?
I'm trying to run Llama 3.1 Instruct 8b with tensor parallelism size of 8 using the iree python package. I am able to compile and benchmark with a local build of iree, however, using the same command with the iree-runtime python package exits and does not report anything when running `iree-run-module/iree-benchmark-module'. I'm not sure if this is expected or not.
Steps to reproduce your issue
iree-base-compiler iree-base-runtime iree-turbine
iree-benchmark-module
:iree-benchmark-module
:What component(s) does this issue relate to?
Runtime
Version information
iree-base-compiler-3.2.0rc20250203
iree-base-runtime-3.2.0rc20250203
iree-turbine-3.2.0rc20250203
Additional context
No response
The text was updated successfully, but these errors were encountered: