Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Issues (NaN) When Migrating from PyTorch #2739

Open
oiwn opened this issue Jan 24, 2025 · 9 comments · Fixed by #2741
Open

Training Issues (NaN) When Migrating from PyTorch #2739

oiwn opened this issue Jan 24, 2025 · 9 comments · Fixed by #2741
Labels
bug Something isn't working

Comments

@oiwn
Copy link

oiwn commented Jan 24, 2025

First, I want to express appreciation for the Burn framework - it's a great step toward bringing ML capabilities to Rust. I'm working on migrating several small PyTorch models to Rust, but I've encountered some issues with BCE loss calculation and training behavior.

When migrating a simple binary classifier from PyTorch to Burn, I'm seeing significant differences in:

  • BCE loss calculation
  • Training behaviour
  • Final model accuracy (46% with Burn, 92% with pytorch)

The model architecture is identical between frameworks (input->64->32->1 with ReLU/Sigmoid).

Reproducible Example

I've created a minimal example repository: [burn-problems]

Key test cases demonstrate the issues:

BCE Loss Test

# Python/PyTorch
Test Case 1 - Perfect predictions:
Predictions: tensor([1., 0., 1., 0.])
Targets:     tensor([1., 0., 1., 0.])
Loss:        0.00000000

Test Case 2 - Wrong predictions:
Predictions: tensor([0., 1., 0., 1.])
Targets:     tensor([1., 0., 1., 0.])
Loss:        100.00000000

Test Case 3 - Uncertain predictions:
Predictions: tensor([0.5000, 0.5000, 0.5000, 0.5000])
Targets:     tensor([1., 0., 1., 0.])
Loss:        0.69314718
// Rust/Burn
Test Case 1 - Perfect predictions:
Predictions: tensor([1.0000, 0.0000, 1.0000, 0.0000])
Targets:     tensor([1, 0, 1, 0])
Loss:        NaN

Test Case 2 - Wrong predictions:
Predictions: tensor([0.0000, 1.0000, 0.0000, 1.0000])
Targets:     tensor([1, 0, 1, 0])
Loss:        inf

Test Case 3 - Uncertain predictions:
Predictions: tensor([0.5000, 0.5000, 0.5000, 0.5000])
Targets:     tensor([1, 0, 1, 0])
Loss:        0.69314718

Training Results

  • PyTorch achieves 93.36% accuracy
  • Burn implementation stops at ~46% accuracy with NaN loss
Model:
DemoClassifierModel {
  input_layer: Linear {d_input: 20, d_output: 64, bias: true, params: 1344}
  hidden_layer1: Linear {d_input: 64, d_output: 32, bias: true, params: 2080}
  output_layer: Linear {d_input: 32, d_output: 1, bias: true, params: 33}
  activation: Relu
  sigmoid: Sigmoid
  params: 3457
}
Total Epochs: 20


| Split | Metric   | Min.     | Epoch    | Max.     | Epoch    |
|-------|----------|----------|----------|----------|----------|
| Train | Accuracy | 46.113   | 1        | 46.113   | 20       |
| Train | Loss     | 0.279    | 10       | NaN      | 20       |
| Valid | Accuracy | 45.766   | 1        | 45.766   | 20       |
| Valid | Loss     | 0.282    | 10       | NaN      | 20       |

Noticed that accuracy is same as zero labels distribution in training set:

Train distribution: total=14690, zeroes=6742 (45.9%), ones=7948 (54.1%)

The critical section appears to be in BCE loss calculation:

// Current implementation
let loss = BinaryCrossEntropyLossConfig::new()
    .init(&output.device())
    .forward(output.clone().squeeze(1), targets.clone());

// Alternative attempt
// let loss = BinaryCrossEntropyLossConfig::new()
//     .init(&output.device())
//     .forward(output.clone(), targets.clone().reshape([batch_size, 1]));

Questions

  1. Recommended way to handle tensor shapes for BCE loss in Burn?
  2. Are there any known issues with batch dimension handling that could cause these discrepancies?
  3. Should the loss calculation approach differ from PyTorch's implementation?
@laggui laggui added the bug Something isn't working label Jan 24, 2025
@laggui
Copy link
Member

laggui commented Jan 24, 2025

Thanks for flagging this!

I believe this is due to the current implementation of the BCE loss for tensor.log() which results in -inf for values of 0.0. We need to clamp the values to make sure we don't have this issue.

Should have a PR to fix this soon.

@laggui laggui mentioned this issue Jan 24, 2025
1 task
@oiwn
Copy link
Author

oiwn commented Jan 24, 2025

@laggui Latest version of Burn unable to compile with backend:Wgpu:

error[E0275]: overflow evaluating the requirement `wgpu_core::validation::NumericType: Sync`                                                                                               
    |                                                                                                                                                                                      
    = help: consider increasing the recursion limit by adding a `#![recursion_limit = "256"]` attribute to your crate (`burn_problems`)                                                    
note: required because it appears within the type `wgpu_core::validation::InterfaceVar`                                                                                                    
   --> .cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-core-24.0.0/src/validation.rs:109:12
    |
109 | pub struct InterfaceVar {
    |            ^^^^^^^^^^^^
note: required because it appears within the type `wgpu_core::validation::Varying`
   --> .cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-core-24.0.0/src/validation.rs:136:6
    |
136 | enum Varying {
    |      ^^^^^^^
note: required because it appears within the type `PhantomData<wgpu_core::validation::Varying>`
   --> .rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/marker.rs:753:12
    |
753 | pub struct PhantomData<T: ?Sized>;

....

With backend::NdArray i got same results - 46% probably issues is not in BCELoss =(

oiwn/burn-problems@c36f56a

Model:                                                                                                                                                                                     
DemoClassifierModel {                                                                                                                                                                      
  input_layer: Linear {d_input: 20, d_output: 64, bias: true, params: 1344}                                                                                                                
  hidden_layer1: Linear {d_input: 64, d_output: 32, bias: true, params: 2080}                                                                                                              
  output_layer: Linear {d_input: 32, d_output: 1, bias: true, params: 33}                                                                                                                  
  activation: Relu                                                                                                                                                                         
  sigmoid: Sigmoid                                                                                                                                                                         
  params: 3457                                                                                                                                                                             
}                                                                                                                                                                                          
Total Epochs: 20                                                                                                                                                                           
                                                                                                                                                                                           
                                                                                                                                                                                           
| Split | Metric   | Min.     | Epoch    | Max.     | Epoch    |                                                                                                                           
|-------|----------|----------|----------|----------|----------|                                                                                                                           
| Train | Loss     | 0.265    | 13       | NaN      | 20       |                                                                                                                           
| Train | Accuracy | 46.052   | 1        | 46.052   | 20       |                                                                                                                           
| Valid | Loss     | 0.258    | 13       | NaN      | 20       |                                                                                                                           
| Valid | Accuracy | 46.011   | 1        | 46.011   | 20       |  

@laggui
Copy link
Member

laggui commented Jan 24, 2025

Yeah we realized this the other day with the upgrade to wgpu 0.24.0.. see this discord convo for reference.

This seems to stem from new complex types in wgpu. As a temporary fix you can actually follow the compiler's help: increase the recursion limit (default is 128). You probably don't need to double it to 256, something around 140 should work.

With backend::NdArray i got same results - 46% probably issues is not in BCELoss =(

I haven't actually tested the whole thing, just isolated the bce loss bug initially 😅 Seems weird that your loss still NaNs 🤔 I'll check it out

/edit: just took a quick glance, looks like it's actually coming from the first linear layer parameters becoming NaN at some point. I'm assuming you validated the input data?

@oiwn
Copy link
Author

oiwn commented Jan 25, 2025

@laggui Input data are identical for PyTorch and Burn.

Pytorch:

=== Training Data Check (Python) ===

Dataset sizes:
Train: 14690 Test: 3673

Feature statistics (train) (first 3):

Feature 0 (feature1):
Mean:   0.0029
StdDev: 1.0035
Min:    -2.1987
Max:    5.7738

Feature 1 (feature2):
Mean:   -0.0012
StdDev: 0.9764
Min:    -1.3040
Max:    17.6505

Feature 2 (feature3):
Mean:   0.0044
StdDev: 1.0126
Min:    -0.7085
Max:    7.9158


Burn:

=== Training Data Check (Rust) ===

Dataset sizes:
Train: 14690 Test: 3673

Feature statistics (train) (first 3):

Feature 0:
Mean:   -0.0034
StdDev: 1.0025
Min:    -2.1987
Max:    5.7738

Feature 1:
Mean:   0.0006
StdDev: 0.9915
Min:    -1.3040
Max:    21.7539

Feature 2:
Mean:   0.0012
StdDev: 1.0048
Min:    -0.7085
Max:    7.9158

@laggui
Copy link
Member

laggui commented Jan 27, 2025

Some slight variations, but as long as the inputs during training don't have some weird values that deviate then I don't think that will be the issue.

I'll reopen this issue but it doesn't seem to be specific to the BCE loss anymore.

@laggui laggui reopened this Jan 27, 2025
@laggui laggui changed the title BCE Loss and Training Issues When Migrating from PyTorch Training Issues (NaN) When Migrating from PyTorch Jan 27, 2025
@nathanielsimard
Copy link
Member

It might also be a difference in the training configuration. If you have a higher learning rate or are missing weight decay, it might lead to unstable training, resulting in NaN values, which render the model useless.

@oiwn
Copy link
Author

oiwn commented Jan 29, 2025

It might also be a difference in the training configuration. If you have a higher learning rate or are missing weight decay, it might lead to unstable training, resulting in NaN values, which render the model useless.

Tried different learning rates with no luck. There is strange correlation between training accuracy (46%) and amount of zeroes in training set labels.

@laggui
Copy link
Member

laggui commented Jan 29, 2025

There is strange correlation between training accuracy (46%) and amount of zeroes in training set labels.

The model has not converged as expected if you're getting NaNs during training, so the solution that gives a 46% accuracy probably defaulted to a simple heuristic that always predicts the same label. And so it is incorrect for all the zero labels if it always predicts one 🙂

Regarding the cause for the NaNs, not quite sure at first glance. Would have to spend a bit of time to investigate.

@oiwn
Copy link
Author

oiwn commented Jan 31, 2025

@laggui thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants