-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
InferenceSession - Catastrophic Error or Unspecified Error is thrown #22815
Comments
Bit more context: I am the developer of Amuse.ai, our app has been out for about a year running DirectML inference without issue a few months back we upgraded from 1.18.1 to 1.19.0, then we started getting a few error reports of "Catastrophic Error" when the user tried to load a model However it is now 10-20 reports a day, so its somehow getting worse? windows update? After upgrading to 1.20.0 we now also get this new error, actually hoping its the root cause of |
2024-11-13 09:19:13.0746439 [E:onnxruntime:, inference_session.cc:2118 onnxruntime::InferenceSession::Initialize::<lambda_a18664140bfa1274480334618139aa6c>::operator ()] Exception during initialization: D:\a_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlGraphFusionHelper.cpp(576)\onnxruntime.DLL!00007FFA288EE903: (caller: 00007FFA2886E449) Exception(1) tid(87c) 8000FFFF Catastrophic failure 2024-11-13 09:19:13.9065553 [E:onnxruntime:, inference_session.cc:2118 onnxruntime::InferenceSession::Initialize::<lambda_a18664140bfa1274480334618139aa6c>::operator ()] Exception during initialization: D:\a_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlGraphFusionHelper.cpp(576)\onnxruntime.DLL!00007FFA288EE903: (caller: 00007FFA2886E449) Exception(2) tid(1a78) 80004005 Unspecified error |
AFAIK "com.microsoft.extensions" is used by onnxruntime-extensions. The extensions have to be manually registered by calling SessionOptions.RegisterOrtExtensions. If you call that multiple times you'll get an error about the DomainToVersion map. However that seems completely unrelated to any DML issues. |
Ok, then that error is a new one and unrelated to the other 2 Was hoping this new exception was the cause, but just looks like a brand new issue that bricks OnnxRuntime, sigh We are unable to rollback to 1.18.1 as Flux and SD3-Large models do not run on the lower opset |
Seems to be system dependent, some systems do it some don't, we have about 3000 concurrent active users and maybe 4% face this issue I only have 1 Laptop PC that does it, sometimes, no rhyme or reason, same OS, same everything There is not state stored by the app that would affect DirectML initialization, just seems to be a race condition inside the DML EP during initialization |
Some debugging questions:
|
This issue seems to occur across various GPUs and driver types—I haven't identified a clear pattern in the crash reports. Both I have tested almost every combination of The issue does not appear to be model-dependent. It occurs with the first model that is attempted to load, and the exception is thrown instantly. It doesn’t seem to even reach IO/Disk, as even something as simple as a tokenizer, which has been used successfully thousands of times before, will throw the error. I’ve been diagnosing this issue for several months and held off reporting it earlier to ensure it wasn’t an issue in our application. After replicating the issue with a simple few lines of code outside of This feels environmental to me due to the way the error presents itself in some cases. In about 90% of reports, the crash occurs the first time the app is started after a reboot. At one point, I was able to replicate this issue consistently on a test laptop, where it occurred after every reboot. Below are the tests I conducted: Test 1 - Amuse
Test 2 - Amuse
In this test, the app was never started before the reboot and had no state—just fresh files copied to disk. Test 3 - Debug App
The debug app was a simple .NET console application that opened a new model Interestingly, the test laptop eventually stopped exhibiting this behavior and has not done so again, regardless of how many times I reboot. This intermittent behavior suggests a strange race condition. I’ve also occasionally encountered this issue during development in Visual Studio—about 2-3 times per week out of thousands of model loads.
Currently, we are in the process of rolling back to I will set up the debugging environment as you suggested and hope to capture more details if I’m lucky. |
Describe the issue
Version 1.19.0
Sometimes when starting an
InferenceSession
this exception,Catastrophic Error
orUnspecified Error
is thrownNo other sessions will work at all until the application is stopped/started
New Unrelated Issue from Version 1.20.0
[ErrorCode:Fail] Trying to add a domain to DomainToVersion map, but the domain is already exist with version range (1, 1000). domain: "com.microsoft.extensions"
This is new to 1.20.0 happens at random like the other 2 error, however seems to be unrelated per the comments below, I upgraded to 1.20.0 to see if the first 2 error were resolved, but it has not, and has introduced this new one
To reproduce
new InferenceSession("Model.onnx") with a known working model
This is extremely hard to replicate, but we are getting plenty of error reports, in most cases it happens the first time after a system reboot, sometimes it just happens randomly
Urgency
Urgent, live application that has started failing globally
Platform
Windows
OS Version
10 & 11
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.19.0
ONNX Runtime API
C#
Architecture
X64
Execution Provider
DirectML
Execution Provider Library Version
1.19.0
The text was updated successfully, but these errors were encountered: