Training Issues (NaN) When Migrating from PyTorch #2739

oiwn · 2025-01-24T10:15:17Z

First, I want to express appreciation for the Burn framework - it's a great step toward bringing ML capabilities to Rust. I'm working on migrating several small PyTorch models to Rust, but I've encountered some issues with BCE loss calculation and training behavior.

When migrating a simple binary classifier from PyTorch to Burn, I'm seeing significant differences in:

BCE loss calculation
Training behaviour
Final model accuracy (46% with Burn, 92% with pytorch)

The model architecture is identical between frameworks (input->64->32->1 with ReLU/Sigmoid).

Reproducible Example

I've created a minimal example repository: [burn-problems]

Key test cases demonstrate the issues:

BCE Loss Test

# Python/PyTorch
Test Case 1 - Perfect predictions:
Predictions: tensor([1., 0., 1., 0.])
Targets:     tensor([1., 0., 1., 0.])
Loss:        0.00000000

Test Case 2 - Wrong predictions:
Predictions: tensor([0., 1., 0., 1.])
Targets:     tensor([1., 0., 1., 0.])
Loss:        100.00000000

Test Case 3 - Uncertain predictions:
Predictions: tensor([0.5000, 0.5000, 0.5000, 0.5000])
Targets:     tensor([1., 0., 1., 0.])
Loss:        0.69314718

// Rust/Burn
Test Case 1 - Perfect predictions:
Predictions: tensor([1.0000, 0.0000, 1.0000, 0.0000])
Targets:     tensor([1, 0, 1, 0])
Loss:        NaN

Test Case 2 - Wrong predictions:
Predictions: tensor([0.0000, 1.0000, 0.0000, 1.0000])
Targets:     tensor([1, 0, 1, 0])
Loss:        inf

Test Case 3 - Uncertain predictions:
Predictions: tensor([0.5000, 0.5000, 0.5000, 0.5000])
Targets:     tensor([1, 0, 1, 0])
Loss:        0.69314718

Training Results

PyTorch achieves 93.36% accuracy
Burn implementation stops at ~46% accuracy with NaN loss

Model:
DemoClassifierModel {
  input_layer: Linear {d_input: 20, d_output: 64, bias: true, params: 1344}
  hidden_layer1: Linear {d_input: 64, d_output: 32, bias: true, params: 2080}
  output_layer: Linear {d_input: 32, d_output: 1, bias: true, params: 33}
  activation: Relu
  sigmoid: Sigmoid
  params: 3457
}
Total Epochs: 20


| Split | Metric   | Min.     | Epoch    | Max.     | Epoch    |
|-------|----------|----------|----------|----------|----------|
| Train | Accuracy | 46.113   | 1        | 46.113   | 20       |
| Train | Loss     | 0.279    | 10       | NaN      | 20       |
| Valid | Accuracy | 45.766   | 1        | 45.766   | 20       |
| Valid | Loss     | 0.282    | 10       | NaN      | 20       |

Noticed that accuracy is same as zero labels distribution in training set:

Train distribution: total=14690, zeroes=6742 (45.9%), ones=7948 (54.1%)

The critical section appears to be in BCE loss calculation:

// Current implementation
let loss = BinaryCrossEntropyLossConfig::new()
    .init(&output.device())
    .forward(output.clone().squeeze(1), targets.clone());

// Alternative attempt
// let loss = BinaryCrossEntropyLossConfig::new()
//     .init(&output.device())
//     .forward(output.clone(), targets.clone().reshape([batch_size, 1]));

Questions

Recommended way to handle tensor shapes for BCE loss in Burn?
Are there any known issues with batch dimension handling that could cause these discrepancies?
Should the loss calculation approach differ from PyTorch's implementation?

The text was updated successfully, but these errors were encountered:

laggui · 2025-01-24T14:49:45Z

Thanks for flagging this!

I believe this is due to the current implementation of the BCE loss for tensor.log() which results in -inf for values of 0.0. We need to clamp the values to make sure we don't have this issue.

Should have a PR to fix this soon.

oiwn · 2025-01-24T17:32:29Z

@laggui Latest version of Burn unable to compile with backend:Wgpu:

error[E0275]: overflow evaluating the requirement `wgpu_core::validation::NumericType: Sync`                                                                                               
    |                                                                                                                                                                                      
    = help: consider increasing the recursion limit by adding a `#![recursion_limit = "256"]` attribute to your crate (`burn_problems`)                                                    
note: required because it appears within the type `wgpu_core::validation::InterfaceVar`                                                                                                    
   --> .cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-core-24.0.0/src/validation.rs:109:12
    |
109 | pub struct InterfaceVar {
    |            ^^^^^^^^^^^^
note: required because it appears within the type `wgpu_core::validation::Varying`
   --> .cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-core-24.0.0/src/validation.rs:136:6
    |
136 | enum Varying {
    |      ^^^^^^^
note: required because it appears within the type `PhantomData<wgpu_core::validation::Varying>`
   --> .rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/marker.rs:753:12
    |
753 | pub struct PhantomData<T: ?Sized>;

....

With backend::NdArray i got same results - 46% probably issues is not in BCELoss =(

oiwn/burn-problems@c36f56a

Model:                                                                                                                                                                                     
DemoClassifierModel {                                                                                                                                                                      
  input_layer: Linear {d_input: 20, d_output: 64, bias: true, params: 1344}                                                                                                                
  hidden_layer1: Linear {d_input: 64, d_output: 32, bias: true, params: 2080}                                                                                                              
  output_layer: Linear {d_input: 32, d_output: 1, bias: true, params: 33}                                                                                                                  
  activation: Relu                                                                                                                                                                         
  sigmoid: Sigmoid                                                                                                                                                                         
  params: 3457                                                                                                                                                                             
}                                                                                                                                                                                          
Total Epochs: 20                                                                                                                                                                           
                                                                                                                                                                                           
                                                                                                                                                                                           
| Split | Metric   | Min.     | Epoch    | Max.     | Epoch    |                                                                                                                           
|-------|----------|----------|----------|----------|----------|                                                                                                                           
| Train | Loss     | 0.265    | 13       | NaN      | 20       |                                                                                                                           
| Train | Accuracy | 46.052   | 1        | 46.052   | 20       |                                                                                                                           
| Valid | Loss     | 0.258    | 13       | NaN      | 20       |                                                                                                                           
| Valid | Accuracy | 46.011   | 1        | 46.011   | 20       |

laggui · 2025-01-24T17:43:18Z

Yeah we realized this the other day with the upgrade to wgpu 0.24.0.. see this discord convo for reference.

This seems to stem from new complex types in wgpu. As a temporary fix you can actually follow the compiler's help: increase the recursion limit (default is 128). You probably don't need to double it to 256, something around 140 should work.

With backend::NdArray i got same results - 46% probably issues is not in BCELoss =(

I haven't actually tested the whole thing, just isolated the bce loss bug initially 😅 Seems weird that your loss still NaNs 🤔 I'll check it out

/edit: just took a quick glance, looks like it's actually coming from the first linear layer parameters becoming NaN at some point. I'm assuming you validated the input data?

oiwn · 2025-01-25T14:01:25Z

@laggui Input data are identical for PyTorch and Burn.

Pytorch:

=== Training Data Check (Python) ===

Dataset sizes:
Train: 14690 Test: 3673

Feature statistics (train) (first 3):

Feature 0 (feature1):
Mean:   0.0029
StdDev: 1.0035
Min:    -2.1987
Max:    5.7738

Feature 1 (feature2):
Mean:   -0.0012
StdDev: 0.9764
Min:    -1.3040
Max:    17.6505

Feature 2 (feature3):
Mean:   0.0044
StdDev: 1.0126
Min:    -0.7085
Max:    7.9158


Burn:

=== Training Data Check (Rust) ===

Dataset sizes:
Train: 14690 Test: 3673

Feature statistics (train) (first 3):

Feature 0:
Mean:   -0.0034
StdDev: 1.0025
Min:    -2.1987
Max:    5.7738

Feature 1:
Mean:   0.0006
StdDev: 0.9915
Min:    -1.3040
Max:    21.7539

Feature 2:
Mean:   0.0012
StdDev: 1.0048
Min:    -0.7085
Max:    7.9158

laggui · 2025-01-27T13:59:58Z

Some slight variations, but as long as the inputs during training don't have some weird values that deviate then I don't think that will be the issue.

I'll reopen this issue but it doesn't seem to be specific to the BCE loss anymore.

nathanielsimard · 2025-01-28T14:26:52Z

It might also be a difference in the training configuration. If you have a higher learning rate or are missing weight decay, it might lead to unstable training, resulting in NaN values, which render the model useless.

oiwn · 2025-01-29T12:53:59Z

It might also be a difference in the training configuration. If you have a higher learning rate or are missing weight decay, it might lead to unstable training, resulting in NaN values, which render the model useless.

Tried different learning rates with no luck. There is strange correlation between training accuracy (46%) and amount of zeroes in training set labels.

laggui · 2025-01-29T16:00:11Z

There is strange correlation between training accuracy (46%) and amount of zeroes in training set labels.

The model has not converged as expected if you're getting NaNs during training, so the solution that gives a 46% accuracy probably defaulted to a simple heuristic that always predicts the same label. And so it is incorrect for all the zero labels if it always predicts one 🙂

Regarding the cause for the NaNs, not quite sure at first glance. Would have to spend a bit of time to investigate.

oiwn · 2025-01-31T16:46:51Z

@laggui thank you!

laggui added the bug Something isn't working label Jan 24, 2025

laggui mentioned this issue Jan 24, 2025

Fix bce loss log #2741

Merged

1 task

laggui closed this as completed in #2741 Jan 24, 2025

laggui reopened this Jan 27, 2025

laggui changed the title ~~BCE Loss and Training Issues When Migrating from PyTorch~~ Training Issues (NaN) When Migrating from PyTorch Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Issues (NaN) When Migrating from PyTorch #2739

Training Issues (NaN) When Migrating from PyTorch #2739

oiwn commented Jan 24, 2025 •

edited

Loading

laggui commented Jan 24, 2025 •

edited

Loading

oiwn commented Jan 24, 2025 •

edited

Loading

laggui commented Jan 24, 2025 •

edited

Loading

oiwn commented Jan 25, 2025 •

edited

Loading

laggui commented Jan 27, 2025

nathanielsimard commented Jan 28, 2025

oiwn commented Jan 29, 2025

laggui commented Jan 29, 2025

oiwn commented Jan 31, 2025

Training Issues (NaN) When Migrating from PyTorch #2739

Training Issues (NaN) When Migrating from PyTorch #2739

Comments

oiwn commented Jan 24, 2025 • edited Loading

Reproducible Example

BCE Loss Test

Training Results

Questions

laggui commented Jan 24, 2025 • edited Loading

oiwn commented Jan 24, 2025 • edited Loading

laggui commented Jan 24, 2025 • edited Loading

oiwn commented Jan 25, 2025 • edited Loading

laggui commented Jan 27, 2025

nathanielsimard commented Jan 28, 2025

oiwn commented Jan 29, 2025

laggui commented Jan 29, 2025

oiwn commented Jan 31, 2025

oiwn commented Jan 24, 2025 •

edited

Loading

laggui commented Jan 24, 2025 •

edited

Loading

oiwn commented Jan 24, 2025 •

edited

Loading

laggui commented Jan 24, 2025 •

edited

Loading

oiwn commented Jan 25, 2025 •

edited

Loading