-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training Issues (NaN) When Migrating from PyTorch #2739
Comments
Thanks for flagging this! I believe this is due to the current implementation of the BCE loss for Should have a PR to fix this soon. |
@laggui Latest version of Burn unable to compile with
With
|
Yeah we realized this the other day with the upgrade to wgpu 0.24.0.. see this discord convo for reference. This seems to stem from new complex types in wgpu. As a temporary fix you can actually follow the compiler's help: increase the recursion limit (default is 128). You probably don't need to double it to 256, something around 140 should work.
I haven't actually tested the whole thing, just isolated the bce loss bug initially 😅 Seems weird that your loss still NaNs 🤔 I'll check it out /edit: just took a quick glance, looks like it's actually coming from the first linear layer parameters becoming NaN at some point. I'm assuming you validated the input data? |
@laggui Input data are identical for PyTorch and Burn.
|
Some slight variations, but as long as the inputs during training don't have some weird values that deviate then I don't think that will be the issue. I'll reopen this issue but it doesn't seem to be specific to the BCE loss anymore. |
It might also be a difference in the training configuration. If you have a higher learning rate or are missing weight decay, it might lead to unstable training, resulting in NaN values, which render the model useless. |
Tried different learning rates with no luck. There is strange correlation between training accuracy (46%) and amount of zeroes in training set labels. |
The model has not converged as expected if you're getting NaNs during training, so the solution that gives a 46% accuracy probably defaulted to a simple heuristic that always predicts the same label. And so it is incorrect for all the zero labels if it always predicts one 🙂 Regarding the cause for the NaNs, not quite sure at first glance. Would have to spend a bit of time to investigate. |
@laggui thank you! |
First, I want to express appreciation for the Burn framework - it's a great step toward bringing ML capabilities to Rust. I'm working on migrating several small PyTorch models to Rust, but I've encountered some issues with BCE loss calculation and training behavior.
When migrating a simple binary classifier from PyTorch to Burn, I'm seeing significant differences in:
The model architecture is identical between frameworks (input->64->32->1 with ReLU/Sigmoid).
Reproducible Example
I've created a minimal example repository: [burn-problems]
Key test cases demonstrate the issues:
BCE Loss Test
Training Results
Noticed that accuracy is same as zero labels distribution in training set:
The critical section appears to be in BCE loss calculation:
Questions
The text was updated successfully, but these errors were encountered: