code repetition in train methods #922

janfb · 2024-01-24T16:00:06Z

Description:

The current implementation of the SBI library contains significant code duplication within the train(...) methods of SNPE, SNRE, and SNLE. These methods share many common functionalities, including:

Building the neural network
Resuming training
Managing the training and validation loops

This redundancy increases the complexity of the codebase, making it harder to maintain and more prone to inconsistencies and bugs, particularly during updates or enhancements.

To address this, we propose refactoring these methods by introducing a unified train function in the base class. This common train function would handle the shared aspects of the training process, while accepting specific losses and other relevant keyword arguments as parameters to handle the differences between SNPE, SNRE, and SNLE.

Example redundancies

SNPE:

sbi/sbi/inference/snpe/snpe_base.py

Lines 340 to 379 in 9e224da

    
           while self.epoch <= max_num_epochs and not self._converged( 
        
               self.epoch, stop_after_epochs 
        
           ): 
        
               # Train for a single epoch. 
        
               self._neural_net.train() 
        
               train_log_probs_sum = 0 
        
               epoch_start_time = time.time() 
        
               for batch in train_loader: 
        
                   self.optimizer.zero_grad() 
        
                   # Get batches on current device. 
        
                   theta_batch, x_batch, masks_batch = ( 
        
                       batch[0].to(self._device), 
        
                       batch[1].to(self._device), 
        
                       batch[2].to(self._device), 
        
                   ) 
        
                   train_losses = self._loss( 
        
                       theta_batch, 
        
                       x_batch, 
        
                       masks_batch, 
        
                       proposal, 
        
                       calibration_kernel, 
        
                       force_first_round_loss=force_first_round_loss, 
        
                   ) 
        
                   train_loss = torch.mean(train_losses) 
        
                   train_log_probs_sum -= train_losses.sum().item() 
        
                   train_loss.backward() 
        
                   if clip_max_norm is not None: 
        
                       clip_grad_norm_( 
        
                           self._neural_net.parameters(), max_norm=clip_max_norm 
        
                       ) 
        
                   self.optimizer.step() 
        
               self.epoch += 1 
        
               train_log_prob_average = train_log_probs_sum / ( 
        
                   len(train_loader) * train_loader.batch_size  # type: ignore 
        
               ) 
        
               self._summary["training_log_probs"].append(train_log_prob_average)

SNLE:

sbi/sbi/inference/snle/snle_base.py

Lines 214 to 244 in 9e224da

    
           while self.epoch <= max_num_epochs and not self._converged( 
        
               self.epoch, stop_after_epochs 
        
           ): 
        
               # Train for a single epoch. 
        
               self._neural_net.train() 
        
               train_log_probs_sum = 0 
        
               for batch in train_loader: 
        
                   self.optimizer.zero_grad() 
        
                   theta_batch, x_batch = ( 
        
                       batch[0].to(self._device), 
        
                       batch[1].to(self._device), 
        
                   ) 
        
                   # Evaluate on x with theta as context. 
        
                   train_losses = self._loss(theta=theta_batch, x=x_batch) 
        
                   train_loss = torch.mean(train_losses) 
        
                   train_log_probs_sum -= train_losses.sum().item() 
        
                   train_loss.backward() 
        
                   if clip_max_norm is not None: 
        
                       clip_grad_norm_( 
        
                           self._neural_net.parameters(), 
        
                           max_norm=clip_max_norm, 
        
                       ) 
        
                   self.optimizer.step() 
        
               self.epoch += 1 
        
               train_log_prob_average = train_log_probs_sum / ( 
        
                   len(train_loader) * train_loader.batch_size  # type: ignore 
        
               ) 
        
               self._summary["training_log_probs"].append(train_log_prob_average)

SNRE:

sbi/sbi/inference/snre/snre_base.py

Lines 228 to 260 in 9e224da

    
           while self.epoch <= max_num_epochs and not self._converged( 
        
               self.epoch, stop_after_epochs 
        
           ): 
        
               # Train for a single epoch. 
        
               self._neural_net.train() 
        
               train_log_probs_sum = 0 
        
               for batch in train_loader: 
        
                   self.optimizer.zero_grad() 
        
                   theta_batch, x_batch = ( 
        
                       batch[0].to(self._device), 
        
                       batch[1].to(self._device), 
        
                   ) 
        
                   train_losses = self._loss( 
        
                       theta_batch, x_batch, num_atoms, **loss_kwargs 
        
                   ) 
        
                   train_loss = torch.mean(train_losses) 
        
                   train_log_probs_sum -= train_losses.sum().item() 
        
                   train_loss.backward() 
        
                   if clip_max_norm is not None: 
        
                       clip_grad_norm_( 
        
                           self._neural_net.parameters(), 
        
                           max_norm=clip_max_norm, 
        
                       ) 
        
                   self.optimizer.step() 
        
               self.epoch += 1 
        
               train_log_prob_average = train_log_probs_sum / ( 
        
                   len(train_loader) * train_loader.batch_size  # type: ignore 
        
               ) 
        
               self._summary["training_log_probs"].append(train_log_prob_average)

Proposed Steps

Identify and abstract the common code segments across the train methods of SNPE, SNRE, and SNLE.
Design a generic train function in the base class that accepts specific losses and other necessary arguments unique to each method. Parts shared by some, but not all methods, should be offloaded into separate class methods that can be overridden by children's classes if required.
Refactor the existing train methods to utilize the new generic function, passing their specific requirements as arguments.

We encourage contributors to discuss strategies for this refactoring and help with the implementation. This effort will improve the library’s maintainability and ensure consistency across its components.

If you identify other areas where significant code duplication can be reduced, please create a new issue (e.g., #921).

The text was updated successfully, but these errors were encountered:

janfb · 2024-07-22T07:22:39Z

This will become even more relevant when we have a common dataloader interface and agnostic loss functions for all SBI methods. But I am removing the hackathon label for now as it will not be done before the release.

janfb added enhancement New feature or request architecture Internal changes without API consequences hackathon labels Jan 24, 2024

janfb added this to the Pre Hackathon 2024 milestone Feb 6, 2024

michaeldeistler mentioned this issue Feb 9, 2024

New structure for loss #932

Closed

janfb mentioned this issue Feb 16, 2024

A single train method #739

Closed

janfb self-assigned this Feb 16, 2024

janfb modified the milestones: Pre Hackathon 2024, Hackathon 2024 Apr 3, 2024

janfb removed the hackathon label Jul 22, 2024

janfb removed this from the Hackathon 2024 milestone Jul 22, 2024

manuelgloeckler added the hackathon label Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code repetition in train methods #922

code repetition in train methods #922

janfb commented Jan 24, 2024 •

edited by manuelgloeckler

Loading

janfb commented Jul 22, 2024

code repetition in train methods #922

code repetition in train methods #922

Comments

janfb commented Jan 24, 2024 • edited by manuelgloeckler Loading

Description:

Example redundancies

Proposed Steps

janfb commented Jul 22, 2024

janfb commented Jan 24, 2024 •

edited by manuelgloeckler

Loading