-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
internal assert failure from sync_information.cpp #4052
Comments
This seems like exposing a non-trivial bug in how the normalization schedulers cache inputs. I need to think about it more deeply. In the meantime, this should work around the issue for this repro, if you really need this repro to work now:
Note that this is really just an ad-hoc WAR for this particular repro. |
The model is defined as follows: model = nn.Sequential(
nn.Linear(in_features, out_features, bias=bias),
nn.GELU(approximate="tanh"),
nn.Linear(out_features, out_features, bias=bias),
).to(device=device, dtype=torch_dtype) Would you expect to not see this error with a model of |
Not sure. It turned out the issue can be rather pervasive. I created a separate issue to clarify what's happening (#4074). |
environment
pjnl-20250309
repro script
error
I got what follows with
NVFUSER_DISABLE=parallel_compile python repro.py
context
I faced this while running
test_torchao_float8_linear
oftest_tensor_subclass.py
with-k nvfuser and bfloat16 and true
where torchao is v0.7.0.The text was updated successfully, but these errors were encountered: