This is the implementation of AdamMCMC. It is a (stochastic) Metropolis-Hastings algorithm, with proposals
centered in Adam-update steps
allows efficient sampling at low
Increasing the width of the proposal distribution
The implementation of ParticleNet is created from the ParT implementation using the weaver
package.
src/AdamMCMC.py
defines our AdamMCMC implementation which can by used in exchange for your usual PyTorch Optimizersrc/MCMC_weaver_util.py
wraps the weaver training code for the use with MCMC methodstrain_METHOD.py
can be used for Network training or samplingeval.py
calculate the Network output for multiple weigth samplestest.ipynb
andsrc/compare_adammccm_sgHMC.ipynb
are used for plotting
An full instructive example of converting a PyTorch training to AdamMCMC sampling is provided seperately at https://github.com/sbieringer/how_to_bayesianise_your_NN.
import torch
import torch.nn as nn
import numpy as np
# For the model
import normflows as nf
# For MCMC Bayesian
from src.AdamMCMC import MCMC_by_bp
# For the data
from sklearn.datasets import make_moons
# Define data
data, _ = make_moons(4096, noise=0.05)
data = torch.from_numpy(data).float()
data = data.to(device)
# Define model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
base = nf.distributions.base.DiagGaussian(2)
num_layers = 5
flows = []
for i in range(num_layers):
param_map = nf.nets.MLP([1, 32, 32, 2], init_zeros=False)
flows.append(nf.flows.AffineCouplingBlock(param_map))
flows.append(nf.flows.Permute(2, mode='swap'))
MCMC_model = nf.NormalizingFlow(base, flows).to(device)
MCMC_model.device = device
# Initialize AdamMCMC
epochs = 10001
batchsize = len(data)
lr = 1e-3
temp = 1 #lambda
sigma = .02 #noise
loop_kwargs = {
'MH': True, #This is a little more than x2 runtime but necessary
'verbose': epochs<10,
'fixed_batches': True, #set this to True so the loss is calculated 2 times per step, set to False only for batchsize = len(data)
'sigma_adam_dir': 800, #choose on the order of the number of parameters of the network
'extended_doc_dict': False,
'full_loss': None, #second loss function can be passed for exact MH-corrections over the full data
}
optimizer = torch.optim.Adam(MCMC_model.parameters(), lr=lr, betas=(0.999, 0.999))
adamMCMC = MCMC_by_bp(MCMC_model, optimizer, temp, sigma)
flow_loss_epoch, acc_prob_epoch, accepted_epoch = np.zeros(epochs), np.zeros(epochs), np.zeros(epochs)
eps = tqdm(range(epochs))
for epoch in eps:
optimizer.zero_grad()
perm = torch.randperm(len(data)).to(device)
for i_step in range((len(data)-1)//batchsize+1):
x = data[perm[i_step*batchsize:(i_step+1)*batchsize].to(device)]
# Need to definde the loss function as a callable
flow_loss = lambda: -torch.sum(MCMC_model.log_prob(x))
flow_loss_old,accept_prob,accepted,_,_ = adamMCMC.step(flow_loss, **loop_kwargs)
flow_loss_epoch[epoch] += flow_loss_old.numpy(force=True)/len(x)
acc_prob_epoch[epoch] = accept_prob
accepted_epoch[epoch] = accepted
#save the ensemble after some burn-in time (to converge) in sufficiently large intervals
#if you loaded a pretrained model, you can also reduce/skip the burn-in
if epoch>4999 and epoch%1000==0:
torch.save(MCMC_model.state_dict(), f"./models/MCMC_model_{epoch}.pth")
eps.set_postfix({'flow_loss': flow_loss_old.item()/len(x), 'accept_prob': accept_prob})
-
train_adam.py
:- beta1_adam:
$\beta_1 = \beta_2$ running average parameters of first and second order momentum of Adam (default=0.99
) - batchsize is fixed at
$512$ and lr at$10^{-3}$
- beta1_adam:
-
train_sgHMC.py
:- lr: learning rate (
default=10^-2
) - C: friction term of sgHMC
- resample_mom: enables momentum resampling
- lr: learning rate (
-
train_MCMC.py
:- temp: temperature parameter as described in the paper
- sigma: standard deviation of the proposal distribution
$\sigma$ =sigma
$/\sqrt{\sharp \vartheta}$ - sigma_adam_dir_denom: covariance factor in update direction
$\sigma_\Delta$ =sigma_adam_dir_denom
$/\sqrt{\sharp \vartheta}$ - optim_str:
"Adam"
or"SGD"
, sets the PyTorch optimizer used for calculating the update steps - beta1_adam:
$\beta_1 = \beta_2$ running average parameters of first and second order momentum of Adam (default=0.99
) - bs: batchisze (
default=512
) - lr: learning rate (
default=10^-3
) - full_loss: enabels using the loss calculated on the full set of data for the Metropolis-Hastings-Correction
For more details see our publication "AdamMCMC: Combining Metropolis Adjusted Langevin with Momentum-based Optimization"
@unpublished{Bieringer_2023_adammcmc,
author = "Bieringer, Sebastian and Kasieczka, Gregor and Steffen, Maximilian F. and Trabs, Mathias",
title = "{AdamMCMC: Combining Metropolis Adjusted Langevin with Momentum-based Optimization}",
eprint = "2312.14027",
archivePrefix = "arXiv",
primaryClass = "stat.ML",
month = "12",
year = "2023",
}.