- 2025.03: We released a survey paper "Efficient Diffusion Models: A Comprehensive Survey from Principles to Practices". Feel free to cite or open pull requests.
Welcome to the repository for our survey paper, "Efficient Diffusion Models: A Comprehensive Survey from Principles to Practices". This repository provides resources and updates related to our research. For a detailed introduction, please refer to our survey paper.
The recent timeline of efficient DMs, covering core methods and the release of open-source and closed-source reproduction projects.
This figure outlines the conceptual framework employed in our presentation of efficient diffusion models.
This figure compares the core features of mainstream diffusion-based generative models
This figure outline various adapters and their applications.
- Improving image generation with better captions [Paper]
- Plug-and-play diffusion features for text-driven image-to-image translation [Paper]
- Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis [Paper]
- Sine: Single image editing with text-to-image diffusion models [Paper]
- Instructpix2pix: Learning to follow image editing instructions[Paper]
- Latent video diffusion models for high-fidelity video generation with arbitrary lengths [Paper]
- MagicVideo: Efficient Video Generation With Latent Diffusion Models [Paper]
- ModelScope Text-to-Video Technical Report [Paper]
- Stable video diffusion: Scaling latent video diffusion models to large
datasets [Paper]
- VideoCrafter1: Open Diffusion Models for High-Quality Video Generation [Paper]
- VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models [Paper]
- StableVideo: Text-driven Consistency-aware Diffusion Video Editing [Paper]
- MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation [Paper]
- Lumiere: A Space-Time Diffusion Model for Video Generation [Paper]
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators [Paper]
- FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing [Paper]
- Dreamix: Video Diffusion Models are General Video Editors
[Paper]
- ControlVideo: Training-free Controllable Text-to-Video Generation [Paper]
- Rerender a video: Zero-shot text-guided video-to-video translation [Paper]
- Dreamfusion: Text-to-3d using 2d diffusion [Paper]
- Mvdream: Multi-view diffusion for 3d generation [Paper]
- Magic3D: High-Resolution Text-to-3D Content Creation [Paper]
- Hifa: High-fidelity text-to-3d with advanced diffusion guidance [Paper]
- SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion [Paper]
- DDM2: Self-Supervised Diffusion MRI Denoising with Generative Diffusion Models [Paper]
- Solving Inverse Problems in Medical Imaging with Score-Based Generative Models [Paper]
- Diffwave: A versatile diffusion model for audio synthesis[[Paper]] (https://arxiv.org/abs/2009.09761)
- Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models [Paper]
- Diffsound: Discrete Diffusion Model for Text-to-Sound Generation [Paper]
- Highly accurate protein structure prediction with AlphaFold [Paper]
- Denovo design of protein structure and function with rfdiffusion [Paper]
- Antigen-specific antibody design and optimization with diffusion-based
generative models for protein structures [Paper]
- A dual diffusion model enables
3d molecule generation and lead optimization based on target
pockets [Paper]
- Diffdock: Diffusion steps, twists, and turns for molecular docking [Paper]
- Fast sampling of diffusion models via operator learning [Paper]
- Progressive distillation for fast sampling
of diffusion models [Paper]
- Consistency Models [Paper]
- Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference [Paper]
- GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models [Paper]
- High-Resolution Image Synthesis With Latent Diffusion Models [Paper]
- Structure and content-guided video synthesis with diffusion
models [Paper]
- Maximum likelihood training of implicit nonlinear diffusion model [Paper]
- Score Approximation, Estimation and Distribution Recovery of Diffusion Models on Low-Dimensional Data [Paper]
- Maximum likelihood training for score-based diffusion odes by high order denoising score matching [Paper]
- Generalized deep 3d shape prior via part-discretized diffusion process [Paper]
- Vector quantized diffusion model for text-to-image synthesis [Paper]
- Understanding Diffusion Models: A Unified Perspective [Paper]
- Diffusion models in vision: A survey [Paper]
- Diffusion models: A comprehensive survey of methods and applications [Paper]
- A Survey on Generative Diffusion Model [Paper]
- Emergent abilities of large language models [Paper]
- GPT-4 Technical Report[Paper]
- Video generation models as world simulators [Online]
- Improved Denoising Diffusion Probabilistic Models [Paper]
- Score-Based Generative Modeling through Stochastic Differential Equations [Paper]
- Taming Transformers for High-Resolution Image Synthesis[Paper]
- Adding Conditional Control to Text-to-Image Diffusion Models [Paper]
- Prompt-to-Prompt Image Editing with Cross Attention Control [Paper]
- Null-text inversion for editing real images using guided diffusion models [Paper]
- Dreambooth: Fine tuning text-to-image diffusion models
for subject-driven generation [Paper]
- Imagic: Text-Based Real Image Editing with Diffusion Models[Paper]
- AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for Text-Based Continuity-Sensitive Image Editing [Paper]
- Cascaded diffusion models for high fidelity image generation [Paper]
- Play-ground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation [Paper]
- All are worth words: A vit backbone for diffusion models [Paper]
- Denoising diffusion implicit models [Paper]
- Diffusion models beat gans on image synthesis[Paper]
- Photorealistic text-to-image diffusion models with deep language understanding [Paper]
- Hierarchical text-conditional image generation with clip latents [Paper]
- CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers [Paper]
- Sdxl: Improving latent diffusion models for high-resolution image synthesis [Paper]
- Scaling rectified flow transformers for high-resolution image synthesis [Paper]
- Scalable diffusion models with transformers [Paper]
- Neural Residual Diffusion Models for Deep Scalable Vision Generation[Paper]
- PixArt-alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis [Paper]
- FiT: Flexible Vision Transformer for Diffusion Model [Paper]
- SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers [Paper]
- Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding [Paper]
- Kolors: Effective training of diffusion model for photorealistic text-to-image synthesis [Online]
- Flux [Online]
- Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models [Paper]
- Open-Sora: Democratizing Efficient Video Production for All [Paper]
- Open-Sora: Democratizing Efficient Video Production for All [Online]
- EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture [Paper]
- CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer [Paper]
- Movie gen: A cast of media foundation models [Online]
- Auto-Encoding Variational Bayes [Paper]
- Taming transformers for high-resolution image synthesis [Paper]
- Neural Discrete Representation Learning [Paper]
- Autoregressive Image Generation Using Residual Quantization [Paper]
- Imagen Video: High Definition Video Generation with Diffusion Models [Paper]
- One transformer fits all distributions in multi-modal
diffusion at scale [Paper]
- Lmd: faster image reconstruction with latent masking diffusion [Paper]
- Make-A-Video: Text-to-Video Generation without Text-Video Data [Paper]
- Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation [Paper]
- Magvit: Masked generative video transformer [Paper]
- CV-VAE: A Compatible Video VAE for Latent Generative Video Models [Paper]
- Phenaki: Variable length video generation from open domain textual descriptions [Paper]
- Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation[Paper]
- U-net: Convolutional
networks for biomedical image segmentation [Paper]
- Score-Based Generative Modeling through Stochastic Differential Equations [Paper]
- Video diffusion models [Paper]
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [Paper]
- Improving language understanding by generative pre-training [Paper]
- GLAF: Global-to-Local Aggregation and Fission Network for Semantic Level Fact Verification [Paper]
- Intention reasoning
network for multi-domain end-to-end task-oriented dialogue [Paper]
- An image is worth 16x16 words:
Transformers for image recognition at scale [Paper]
- Cmal: A novel crossmodal associative learning framework for vision-language pretraining[Paper]
- Unitranser: A unified transformer semantic representation framework for multimodal task-oriented dialog system [Paper]
- Hybridprompt: bridging language
models and human priors in prompt tuning for visual question
answering [Paper]
- HybridPrompt: Bridging Language Models and Human Priors in Prompt Tuning for Visual Question Answering [Paper]
- Generative pretraining from pixels [Paper]
- Zero-shot text-to-image generation [Paper]
- Exploring the limits of transfer learning with a unified text-to-text transformer [Paper]
- Modeling Sequences with Structured State Spaces [Paper]
- Exploring Adversarial Robustness of Deep State Space Models [Paper]
- Hippo: Recurrent memory with optimal polynomial projections [Paper]
- Efficiently modeling long sequences with structured state spaces [Paper]
- Diagonal State Spaces are as Effective as Structured State Spaces [Paper]
- Mamba: Linear-time sequence modeling with
selective state spaces [Paper]
- Dim: Diffusion mamba for efficient high-resolution image synthesis [Paper]
- ZigMa: A DiT-style Zigzag Mamba Diffusion Model [Paper]
- Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models [Paper]
- RWKV: Reinventing RNNs for the Transformer Era [Paper]
- Dig: Scalable and efficient diffusion models with gated
linear attention [Paper]
- Gated Linear Attention Transformers with Hardware-Efficient Training [Paper]
- An empirical study and analysis of text-to-image
generation using large language model-powered textual representation[[Paper]]https://arxiv.org/abs/2405.12914)
- Learning transferable visual models from natural language supervision [Paper]
- Bert: Pre-training of deep bidirectional transformers for language understanding [Paper]
- AltDiffusion: A Multilingual Text-to-Image Diffusion Model [Paper]
- PAI-Diffusion: Constructing and Serving a Family of Open Chinese Diffusion Models for Text-to-image Synthesis on the Cloud [Paper]
- PAI-Diffusion: Constructing and Serving a Family of Open Chinese Diffusion Models for Text-to-image Synthesis on the Cloud [Paper]
- Learning transferable visual models from natural language supervision [Paper]
- Exploring the limits of transfer
learning with a unified text-to-text transformer [Paper]
- Baichuan 2: Open large-scale language
models [Paper]
- Llama: Open and efficient foundation language models [Paper]
- Llama 2: Open Foundation and Fine-Tuned Chat Models [Paper]
- GLM: General Language Model Pretraining with Autoregressive Blank Infilling[Paper]
- Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models [Paper]
- Lora: Low-rank adaptation of large language models [Paper]
- SimDA: Simple Diffusion Adapter for Efficient Video Generation [Paper]
- I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models [Paper]
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models [Paper]
- AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning [Paper]
- eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers [Paper]
- Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning [Paper]
- Image-To-Image Translation With Conditional Adversarial Networks [Paper]
- ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems [Paper]
- Controlnext: Powerful and efficient control for image and video generation [Paper]
- ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback [Paper]
- Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models [Paper]
- Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model [Paper]
- SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models [Paper]
- Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models [Paper]
- It's All About Your Sketch: Democratising Sketch Control in Diffusion Models [Paper]
- Efficient parametrization of multi-domain deep neural networks [Paper]
- Facechain-imagineid: Freely crafting
high-fidelity diverse talking faces from disentangled audio [Paper]
- X-adapter: Adding universal compatibility of
plugins for upgraded diffusion model [Paper]
- Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning [Paper]
- Measuring the intrinsic dimension of objective landscapes [Paper]
- LCM-LoRA: A Universal Stable-Diffusion Acceleration Module [Paper]
- LoRA-Composer: Leveraging Low-Rank Adaptation for Multi-Concept Customization in Training-Free Diffusion Models [Paper]
- DiffMorpher: Unleashing the Capability of Diffusion Models for
Image Morphing [Paper]
- Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models [Paper]
- Human Preference Score: Better Aligning Text-to-Image Models with Human Preference [Paper]
- Aligning text-to-image models using human feedback [Paper]
- ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation [Paper]
- Pick-a-pic: An open dataset of user preferences for text-to-image generation [Paper]
- RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment [Paper]
- Training diffusion models with reinforcement learning [Paper]
- Reinforcement learning for fine-tuning text-to-image diffusion models[Paper]
- Using human feedback to fine-tune diffusion models without
any reward model [Paper]
- Diffusion
model alignment using direct preference optimization [Paper]
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model [Paper]
*An image is worth one word: Personalizing text-to-image generation using textual inversion [Paper]
- ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation[Paper]
- Photomaker: Customizing realistic human photos via stacked id embedding [Paper]
- Hyperdreambooth: Hypernet-works for fast personalization of text-to-image models [Paper]
- BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing [Paper]
- InstantID: Zero-shot Identity-Preserving Generation in Seconds [Paper]
- OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models [Paper]
- Multi-concept customization of text-to-image diffusion [Paper]
- Mix-of-show: Decentralized low-rank
adaptation for multi-concept customization of diffusion models [Paper]
- MoA: Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation
[Paper]
- Designing an encoder for fast personalization
of text-to-image models [Paper]
- Dreamtuner:
Single image is enough for subject-driven generation [Paper]
- Instantbooth: Personalized text-to-image generation without test-time finetuning [Paper]
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding
and generation[Paper]
- Diffusion models: A comprehensive
survey of methods and applications [Paper]
- Progressive distillation for fast sampling
of diffusion models [Paper]
- On distillation of guided diffusion models, [Paper]
- Adversarial Diffusion Distillation [Paper]
- Flow straight and fast: Learning to
generate and transfer data with rectified flow [Paper]
- Instaflow: One step is enough for high-quality diffusion-based text-to-image generation[Paper] 
- Ufogen: You forward once
large scale text-to-image generation via diffusion gans[Paper]
- Dpm-solver: a fast ode solver for diffusion probabilistic model sampling in around 10 steps [Paper]
- Elucidating the
design space of diffusion-based generative models [Paper]
- Denoising diffusion implicit models [Paper]
- Generative Modeling by Estimating Gradients of the Data Distribution[Paper]
- Adversarial score matching and improved sampling for image generation [Paper]
- Score-Based Generative Modeling with Critically-Damped Langevin Diffusion [Paper]
- Gotta Go Fast When Generating Data with Score-Based Models [Paper]
- Come-Closer-Diffuse-Faster: Accelerating Conditional Diffusion Models for
Inverse Problems through Stochastic Contraction [Paper]
- Pseudo Numerical Methods for Diffusion Models on Manifolds[Paper]
- DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models [Paper]
- gDDIM: Generalized denoising diffusion implicit models [Paper]
- Fast sampling of diffusion models with
exponential integrator [Paper]
- ReDi: Efficient Learning-Free Diffusion Inference via Trajectory Retrieval[Paper]
- Genie: higher-order denoising diffusion solvers [Paper]
- Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed [Paper]
- Unipc: a unified predictor-corrector framework for fast sampling of diffusion models [Paper]
- Accelerating diffusion sampling with optimized time steps [Paper]
- SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations [Paper]
- Uncovering the disentanglement capability in text-to-image diffusion models[Paper]
- Distilling the Knowledge in a Neural Network [Paper]
- One-step diffusion with distribution matching distillation[Paper]
- Nvae: a deep hierarchical variational autoencoder [Paper]
- Large Scale GAN Training for High Fidelity Natural Image Synthesis [Paper]
- Classifier-free diffusion guidance [Paper]
- Variational Diffusion Models [Paper]
- Generative Adversarial Nets [Paper]
- Optimizing DDPM Sampling with Shortcut Fine-Tuning [Paper]
- Perflow: Piecewise rectified flow as universal plug-and-play accelerator [Paper]
- Flow matching for generative modeling [Paper]
- Stochastic Interpolants: A Unifying Framework for Flows and Diffusions [Paper]
- Fourier Neural Operator for Parametric Partial Differential Equations [Paper]
- Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation [Paper]
- AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning [Paper]
- Learning Universal Policies via Text-Guided Video Generation [Paper]
- DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models [Paper]
- GANs trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium [Paper]
- StyleGAN-NADA: CLIP-guided domain adaptation of image generators [Paper]
- DINOv2: Learning Robust Visual Features without Supervision [Paper]
- Which Training Methods for GANs do actually Converge? [Paper]
- Tackling the Generative Learning Trilemma with Denoising Diffusion GANs [Paper]
- Semi-Implicit Denoising Diffusion Models (SIDDMs) [Paper]
- Learning to Efficiently Sample from Diffusion Probabilistic Models [Paper]
- Learning fast
samplers for diffusion models by differentiating through sample
quality [Paper]
- Post-training quantization on diffusion models [Paper]
- Ptqd:
accurate post-training quantization for diffusion models [Paper]
- Accelerating Diffusion Models via Early Stop of the Diffusion Process [Paper]
- Truncated diffusion
probabilistic models [Paper]
- Truncated Diffusion Probabilistic Models and Diffusion-based Adversarial Auto-Encoders[Paper]
- A style-based generator architecture for generative adversarial networks[Paper]
- Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via
GPU-Aware Optimizations [Paper]
- SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds [Paper]
- Mobilediffusion:
Instant text-to-image generation on mobile devices [Paper]
- Distrifusion: Distributed parallel inference for high resolution diffusion models[Paper]
- Pipefusion: Displaced
patch pipeline parallelism for inference of diffusion transformer models[Paper]
- Dragondiffusion: Enabling drag-style manipulation on diffusion models [Paper]
- Asyncdiff:
Parallelizing diffusion models by asynchronous denoising [Paper]
- A Survey on Mixture of Experts [Paper]
- The evolution of mixture of experts: A survey from basics to breakthroughs
- Dynamic Diffusion Transformer [Paper]
- Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection [Paper]
- Fast inference from
transformers via speculative decoding [Paper]
- Accelerating Large Language Model Decoding with Speculative Sampling [Paper]
- Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion [Paper]
- T-stitch: Accelerating sampling in pretrained diffusion models with trajectory stitching [Paper]
If you find this work useful, welcome to cite us.
@article{ma2024efficient,
title={Efficient diffusion models: A comprehensive survey from principles to practices},
author={Ma, Zhiyuan and Zhang, Yuzhu and Jia, Guoli and Zhao, Liangliang and Ma, Yichao and Ma, Mingjie and Liu, Gaofeng and Zhang, Kaiyan and Li, Jianjun and Zhou, Bowen},
journal={arXiv preprint arXiv:2410.11795},
year={2024}
}
We would like to express our gratitude to Qi‘ang Hu for his contribution to this website, and also express our gratitude to all team members.