This project implements a framework for training Latent Diffusion Models (LDM) using disentangled representations. Specifically, we separate input data into shared (c) and modality-specific (uv) components, and model the generation process in the latent space using VAEs and Diffusion Models.
.
├── vae_uv/ # VAE for modality-specific (uv) representation
│ └── vae_train.py # Training script for VAE
│
├── diffusion_uv/ # Diffusion model operating on uv and c parts
│ └── diffusion_train.py# Training script for latent diffusion
│
├── utils/ # Utility functions (optional)
├── data/ # Dataset files (if any)
└── README.md # This file
- uv: Modality-specific representation extracted from each modality.
- c: Shared or common representation across modalities.
- These representations are first learned using a VAE and then used for conditional or unconditional diffusion model training.
The VAE is trained to encode and reconstruct the uv (modality-specific) latent variables.
cd vae_uv
python vae_train.py
Make sure to modify vae_train.py
to point to your dataset and specify hyperparameters if needed.
The diffusion model operates on the latent space defined by uv
and optionally conditioned on c
.
cd diffusion_uv
python diffusion_train.py
Ensure that:
- The pretrained VAE model is properly loaded to encode the input into latent space.
- The
c
(shared part) is extracted and passed as a condition (if applicable).
- Python 3.10.16
- Other common libraries: numpy, tqdm, torchvision, etc.
You can install them via:
pip install -r requirements.txt
- Make sure the output of
vae_train.py
(typically latent representations) is saved and used indiffusion_train.py
. - You can customize the encoder-decoder architecture in
vae_uv
, or diffusion configuration indiffusion_uv
.
The code is based on https://github.com/mueller-franzes/medfusion