MiniSora Community

👋 join us on WeChat

The MiniSora open-source community is positioned as a community-driven initiative organized spontaneously by community members. The MiniSora community aims to explore the implementation path and future development direction of Sora.

Regular round-table discussions will be held with the Sora team and the community to explore possibilities.
We will delve into existing technological pathways for video generation.
Leading the replication of papers or research results related to Sora, such as DiT (MiniSora-DiT), etc.
Conducting a comprehensive review of Sora-related technologies and their implementations, i.e., "From DDPM to Sora: A Review of Video Generation Models Based on Diffusion Models".

Hot News

Stable Diffusion 3: MM-DiT: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
MiniSora-DiT: Reproducing the DiT Paper with XTuner
Introduction of MiniSora and Latest Progress in Replicating Sora

Reproduction Group of MiniSora Community

Sora Reproduction Goals of MiniSora

GPU-Friendly: Ideally, it should have low requirements for GPU memory size and the number of GPUs, such as being trainable and inferable with compute power like 8 A100 80G cards, 8 A6000 48G cards, or RTX4090 24G.
Training-Efficiency: It should achieve good results without requiring extensive training time.
Inference-Efficiency: When generating videos during inference, there is no need for high length or resolution; acceptable parameters include 3-10 seconds in length and 480p resolution.

MiniSora-DiT: Reproducing the DiT Paper with XTuner

https://github.com/mini-sora/minisora-DiT

Requirements

We are recruiting MiniSora Community contributors to reproduce DiT using XTuner.

We hope the community member has the following characteristics:

Familiarity with the OpenMMLab MMEngine mechanism.
Familiarity with DiT.

Background

The author of DiT is the same as the author of Sora.
XTuner has the core technology to efficiently train sequences of length 1000K.

Support

Computational resources: 2*A100.
Strong supports from XTuner core developer P佬@pppppM.

Recent round-table Discussions

Paper Interpretation of Stable Diffusion 3 paper: MM-DiT

Speaker: MMagic Core Contributors

Live Streaming Time: 03/12 20:00

Highlights: MMagic core contributors will lead us in interpreting the Stable Diffusion 3 paper, discussing the architecture details and design principles of Stable Diffusion 3.

PPT: FeiShu Link

Highlights from Previous Discussions

Night Talk with Sora: Video Diffusion Overview

ZhiHu Notes: A Survey on Generative Diffusion Model: An Overview of Generative Diffusion Models

Paper Reading Program

Sora: Creating video from text
Technical Report: Video generation models as world simulators
Latte: Latte: Latent Diffusion Transformer for Video Generation
- Latte Paper Interpretation (zh-CN), ZhiHu(zh-CN)
DiT: Scalable Diffusion Models with Transformers
Stable Cascade (ICLR 24 Paper): Würstchen: An efficient architecture for large-scale text-to-image diffusion models
Stable Diffusion 3: MM-DiT: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
- SD3 Paper Interpretation (zh-CN), ZhiHu(zh-CN)
Updating...

Recruitment of Presenters

Related Work

01 Diffusion Model
02 Diffusion Transformer
03 Baseline Video Generation Models
04 Video Generation
05 Dataset
06 Patchifying Methods
07 Long-context
08 Audio Related Resource
09 Consistency
10 Prompt Engineering
11 Security
12 World Model
13 Video Compression
14 Mamba
- 14.1 Theoretical Foundations and Model Architecture
- 14.2 Image Generation and Visual Applications
- 14.3 Video Processing and Understanding
- 14.4 Medical Image Processing
15 Existing high-quality resources

Diffusion Models
Paper	Link
1) Guided-Diffusion: Diffusion Models Beat GANs on Image Synthesis	NeurIPS 21 Paper, GitHub
2) Latent Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models	CVPR 22 Paper, GitHub
3) EDM: Elucidating the Design Space of Diffusion-Based Generative Models	NeurIPS 22 Paper, GitHub
4) DDPM: Denoising Diffusion Probabilistic Models	NeurIPS 20 Paper, GitHub
5) DDIM: Denoising Diffusion Implicit Models	ICLR 21 Paper, GitHub
6) Score-Based Diffusion: Score-Based Generative Modeling through Stochastic Differential Equations	ICLR 21 Paper, GitHub, Blog
7) Stable Cascade: Würstchen: An efficient architecture for large-scale text-to-image diffusion models	ICLR 24 Paper, GitHub, Blog
8) Diffusion Models in Vision: A Survey	TPAMI 23 Paper, GitHub
9) Improved DDPM: Improved Denoising Diffusion Probabilistic Models	ICML 21 Paper, Github
Diffusion Transformer
Paper	Link
1) UViT: All are Worth Words: A ViT Backbone for Diffusion Models	CVPR 23 Paper, GitHub, ModelScope
2) DiT: Scalable Diffusion Models with Transformers	ICCV 23 Paper, GitHub, Project, ModelScope
3) SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers	Paper, GitHub, ModelScope
4) FiT: Flexible Vision Transformer for Diffusion Model	Paper, GitHub
5) k-diffusion: Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers	Paper, GitHub
6) Large-DiT: Large Diffusion Transformer	GitHub
7) VisionLLaMA: A Unified LLaMA Interface for Vision Tasks	Paper, GitHub
8) Stable Diffusion 3: MM-DiT: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis	Paper, Blog
9) PIXART-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation	Paper, Project
10) PIXART-α: Fast Training of Diffusion Transformer for Photorealistic Text-To-Image Synthesiss	Paper, GitHub
11) PIXART-δ: Fast and Controllable Image Generation With Latent Consistency Model	Paper,
Baseline Video Generation Models
Paper	Link
1) ViViT: A Video Vision Transformer	ICCV 21 Paper, GitHub
2) VideoLDM: Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models	CVPR 23 Paper
3) DiT: Scalable Diffusion Models with Transformers	ICCV 23 Paper, Github, Project, ModelScope
4) Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators	Paper, GitHub
5) Latte: Latent Diffusion Transformer for Video Generation	Paper, GitHub, Project
Video Generation
Paper	Link
1) Animatediff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning	ICLR 24 Paper, GitHub, ModelScope
2) I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models	Paper, GitHub, ModelScope
3) Imagen Video: High Definition Video Generation with Diffusion Models	Paper
4) MoCoGAN: Decomposing Motion and Content for Video Generation	CVPR 18 Paper
5) Adversarial Video Generation on Complex Datasets	Paper
6) W.A.L.T: Photorealistic Video Generation with Diffusion Models	Paper, Project
7) VideoGPT: Video Generation using VQ-VAE and Transformers	Paper, GitHub
8) Video Diffusion Models	Paper, GitHub, Project
9) MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation	NeurIPS 22 Paper, GitHub, Project, Blog
10) VideoPoet: A Large Language Model for Zero-Shot Video Generation	Paper, Project, Blog
11) MAGVIT: Masked Generative Video Transformer	CVPR 23 Paper, GitHub, Project, Colab
12) EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions	Paper, GitHub, Project
13) SimDA: Simple Diffusion Adapter for Efficient Video Generation	Paper, GitHub, Project
14) StableVideo: Text-driven Consistency-aware Diffusion Video Editing	ICCV 23 Paper, GitHub, Project
15) SVD: Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets	Paper, GitHub
16) ADD: Adversarial Diffusion Distillation	Paper, GitHub
17) GenTron: Diffusion Transformers for Image and Video Generation	CVPR 24 Paper, Project
18) LFDM: Conditional Image-to-Video Generation with Latent Flow Diffusion Models	CVPR 23 Paper, GitHub
19) MotionDirector: Motion Customization of Text-to-Video Diffusion Models	Paper, GitHub
20) TGAN-ODE: Latent Neural Differential Equations for Video Generation	Paper, GitHub
21) VideoCrafter1: Open Diffusion Models for High-Quality Video Generation	Paper, GitHub
22) VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models	Paper, GitHub
23) LVDM: Latent Video Diffusion Models for High-Fidelity Long Video Generation	Paper, GitHub
24) LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models	Paper, GitHub ,Project
25) PYoCo: Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models	ICCV 23 Paper, Project
26) VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation	CVPR 23 Paper
Dataset
Dataset Name - Paper	Link
1) Panda-70M - Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers `70M Clips, 720P, Downloadable`	CVPR 24 Paper, Github, Project
2) InternVid-10M - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation `10M Clips, 720P, Downloadable`	ArXiv 24 Paper, Github
3) CelebV-Text - CelebV-Text: A Large-Scale Facial Text-Video Dataset `70K Clips, 720P, Downloadable`	CVPR 23 Paper, Github, Project
4) HD-VG-130M - VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation `130M Clips, 720P, Downloadable`	ArXiv 23 Paper, Github, Tool
5) HD-VILA-100M - Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions `100M Clips, 720P, Downloadable`	CVPR 22 Paper, Github
6) VideoCC - Learning Audio-Video Modalities from Image Captions `10.3M Clips, 720P, Downloadable`	ECCV 22 Paper, Github
7) YT-Temporal-180M - MERLOT: Multimodal Neural Script Knowledge Models `180M Clips, 480P, Downloadable`	NeurIPS 21 Paper, Github, Project
8) HowTo100M - HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips `136M Clips, 240P, Downloadable`	ICCV 19 Paper, Github, Project
9) UCF101 - UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild `13K Clips, 240P, Downloadable`	CVPR 12 Paper, Project
10) MSVD - Collecting Highly Parallel Data for Paraphrase Evaluation `122K Clips, 240P, Downloadable`	ACL 11 Paper, Project
11) Fashion-Text2Video - A human video dataset with rich label and text annotations `600 Videos, 480P, Downloadable`	ArXiv 23 Paper, Project
12) LAION-5B - A dataset of 5,85 billion CLIP-filtered image-text pairs, 14x bigger than LAION-400M `5B Clips, Downloadable`	NeurIPS 22 Paper, Project
13) ActivityNet Captions - ActivityNet Captions contains 20k videos amounting to 849 video hours with 100k total descriptions, each with its unique start and end time `20k videos, Downloadable`	Arxiv 17 Paper, Project
14) MSR-VTT - A large-scale video benchmark for video understanding `10k Clips, Downloadable`	CVPR 16 Paper, Project
15) The Cityscapes Dataset - Benchmark suite and evaluation server for pixel-level, instance-level, and panoptic semantic labeling `Downloadable`	Arxiv 16 Paper, Project
16) Youku-mPLUG - First open-source large-scale Chinese video text dataset `Downloadable`	Arxiv 23 Paper, Project
17) VidProM - VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models `6.69M, Downloadable`	Arxiv 24 Paper, Github
18) Pixabay100 - A video dataset collected from Pixabay `Downloadable`	Github
NMNP: Nice method, not public
1) WebVid - Large-scale text-video dataset, containing 10 million video-text pairs scraped from the stock footage sites `10M video-text pairs`	Arxiv 21 Paper, Project
Video Augmentation Methods
Basic Transformations
Temporal Transmations
Three-stream CNNs for action recognition	PRL 17 Paper
Dynamic Hand Gesture Recognition Using Multi-direction 3D Convolutional Neural Networks	EL 19 Paper
Image-level Augmentation
Intra-clip Aggregation for Video Person Re-identification	ICIP 20 Paper
Frame Mixing
VideoMix: Rethinking Data Augmentation for Video Classification	CVPR 20 Paper
mixup: Beyond Empirical Risk Minimization	ICLR 17 Paper
CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features	ICCV 19 Paper
Optical flow warping
Video Salient Object Detection via Fully Convolutional Networks	ICIP 18 Paper
Illumination Changing
Illumination-Based Data Augmentation for Robust Background Subtraction	SKIMA 19 Paper
Image editing-based data augmentation for illumination-insensitive background subtraction	EIM 20 Paper
Simulated fisheye model
Universal Semantic Segmentation for Fisheye Urban Driving Images	SMC 20 Paper
Feature Space
Feature Re-Learning with Data Augmentation for Content-based Video Recommendation	ACM 18 Paper
GAC-GAN: A General Method for Appearance-Controllable Human Video Motion Transfer	Trans 21 Paper
GAN-based Augmentation
Deep Video-Based Performance Cloning	CVPR 18 Paper
Adversarial Action Data Augmentation for Similar Gesture Action Recognition	IJCNN 19 Paper
Self-Paced Video Data Augmentation by Generative Adversarial Networks with Insufficient Samples	MM 20 Paper
GAC-GAN: A General Method for Appearance-Controllable Human Video Motion Transfer	Trans 20 Paper
Dynamic Facial Expression Generation on Hilbert Hypersphere With Conditional Wasserstein Generative Adversarial Nets	TPAMI 20 Paper
CrowdGAN: Identity-Free Interactive Crowd Video Generation and Beyond	TPAMI 22 Paper
Encoder/Decoder Based
Rotationally-Temporally Consistent Novel View Synthesis of Human Performance Video	ECCV 20 Paper
Autoencoder-based Data Augmentation for Deepfake Detection	ACM 23 Paper
Simulation
Unreal Engine (UE) based
A data augmentation methodology for training machine/deep learning gait recognition algorithms	CVPR 16 Paper
ElderSim: A Synthetic Data Generation Platform for Human Action Recognition in Eldercare Applications	IEEE 21 Paper
Mid-Air: A Multi-Modal Dataset for Extremely Low Altitude Drone Flights	CVPR 19 Paper
Unity based
Generating Human Action Videos by Coupling 3D Game Engines and Probabilistic Graphical Models	IJCV 19 Paper
Using synthetic data for person tracking under adverse weather conditions	IVC 21 Paper
GTA Game based
Unlimited Road-scene Synthetic Annotation (URSA) Dataset	ITSC 18 Paper
SAIL-VOS 3D: A Synthetic Dataset and Baselines for Object Detection and 3D Mesh Reconstruction From Video Data	CVPR 21 Paper
Patchifying Methods
Paper	Link
1) ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale	CVPR 21 Paper, Github
2) MAE: Masked Autoencoders Are Scalable Vision Learners	CVPR 22 Paper, Github
3) ViViT: A Video Vision Transformer (-)	ICCV 21 Paper, GitHub
4) DiT: Scalable Diffusion Models with Transformers (-)	ICCV 23 Paper, GitHub, Project, ModelScope
5) U-ViT: All are Worth Words: A ViT Backbone for Diffusion Models (-)	CVPR 23 Paper, GitHub, ModelScope
6) FlexiViT: One Model for All Patch Sizes	Paper, Github
7) Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution	Paper, Github
8) VQ-VAE: Neural Discrete Representation Learning	Paper, Github
9) VQ-GAN: Neural Discrete Representation Learning	CVPR 21 Paper, Github
10) LVT: Latent Video Transformer	Paper, Github
11) VideoGPT: Video Generation using VQ-VAE and Transformers (-)	Paper, GitHub
12) Predicting Video with VQVAE	Paper
13) CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers	ICLR 23 Paper, Github
14) TATS: Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer	ECCV 22 Paper, Github
15) MAGVIT: Masked Generative Video Transformer (-)	CVPR 23 Paper, GitHub, Project, Colab
16) MagViT2: Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation	ICLR 24 Paper, Github
17) VideoPoet: A Large Language Model for Zero-Shot Video Generation (-)	Paper, Project, Blog
18) CLIP: Learning Transferable Visual Models From Natural Language Supervision	CVPR 21 Paper, Github
19) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation	Paper, Github
20) BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	Paper, Github
Long-context
Paper	Link
1) World Model on Million-Length Video And Language With RingAttention	Paper, GitHub
2) Ring Attention with Blockwise Transformers for Near-Infinite Context	Paper, GitHub
3) Extending LLMs' Context Window with 100 Samples	Paper, GitHub
4) Efficient Streaming Language Models with Attention Sinks	ICLR 24 Paper, GitHub
5) The What, Why, and How of Context Length Extension Techniques in Large Language Models – A Detailed Survey	Paper
6) MovieChat: From Dense Token to Sparse Memory for Long Video Understanding	CVPR 24 Paper, GitHub, Project
7) MemoryBank: Enhancing Large Language Models with Long-Term Memory	Paper, GitHub
Audio Related Resource
Paper	Link
1) Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion	Paper, Github, Blog
2) MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation	CVPR 23 Paper, GitHub
3) Pengi: An Audio Language Model for Audio Tasks	NeurIPS 23 Paper, GitHub
4) Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset	NeurlPS 23 Paper, GitHub
5) Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration	Paper, GitHub
6) NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality	TPAMI 24 Paper, GitHub
7) NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers	ICLR 24 Paper, GitHub
8) UniAudio: An Audio Foundation Model Toward Universal Audio Generation	Paper, GitHub
9) Diffsound: Discrete Diffusion Model for Text-to-sound Generation	TASLP 22 Paper
10) AudioGen: Textually Guided Audio Generation	ICLR 23 Paper, Project
11) AudioLDM: Text-to-audio generation with latent diffusion models	ICML 23 Paper, GitHub, Project, Huggingface
12) AudioLDM2: Learning Holistic Audio Generation with Self-supervised Pretraining	Paper, GitHub, Project, Huggingface
13) Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models	ICML 23 Paper, GitHub
14) Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation	Paper
15) TANGO: Text-to-audio generation using instruction-tuned LLM and latent diffusion model	Paper, GitHub, Project, Huggingface
16) AudioLM: a Language Modeling Approach to Audio Generation	Paper
17) AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head	Paper, GitHub
18) MusicGen: Simple and Controllable Music Generation	NeurIPS 23 Paper, GitHub
19) LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT	Paper
20) Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners	CVPR 24 Paper
21) Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding	EMNLP 23 Paper
22) Audio-Visual LLM for Video Understanding	Paper
23) VideoPoet: A Large Language Model for Zero-Shot Video Generation (-)	Paper, Project, Blog
Consistency
Paper	Link
1) Consistency Models	Paper, GitHub
2) Improved Techniques for Training Consistency Models	Paper
3) Score-Based Diffusion: Score-Based Generative Modeling through Stochastic Differential Equations (-)	ICLR 21 Paper, GitHub, Blog
4) Improved Techniques for Training Score-Based Generative Models	NIPS 20 Paper, GitHub
4) Generative Modeling by Estimating Gradients of the Data Distribution	NIPS 19 Paper, GitHub
5) Maximum Likelihood Training of Score-Based Diffusion Models	NIPS 21 Paper, GitHub
6) Layered Neural Atlases for Consistent Video Editing	TOG 21 Paper, GitHub, Project
7) StableVideo: Text-driven Consistency-aware Diffusion Video Editing	ICCV 23 Paper, GitHub, Project
8) CoDeF: Content Deformation Fields for Temporally Consistent Video Processing	Paper, GitHub, Project
9) Sora Generates Videos with Stunning Geometrical Consistency	Paper, GitHub, Project
10) Efficient One-stage Video Object Detection by Exploiting Temporal Consistency	ECCV 22 Paper, GitHub
11) Bootstrap Motion Forecasting With Self-Consistent Constraints	ICCV 23 Paper
12) Enforcing Realism and Temporal Consistency for Large-Scale Video Inpainting	Paper
13) Enhancing Multi-Camera People Tracking with Anchor-Guided Clustering and Spatio-Temporal Consistency ID Re-Assignment	CVPRW 23 Paper, GitHub
14) Exploiting Spatial-Temporal Semantic Consistency for Video Scene Parsing	Paper
15) Semi-Supervised Crowd Counting With Spatial Temporal Consistency and Pseudo-Label Filter	TCSVT 23 Paper
16) Spatio-temporal Consistency and Hierarchical Matching for Multi-Target Multi-Camera Vehicle Tracking	CVPRW 19 Paper
17) VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning (-)	Paper
18) VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM (-)	Paper
19) MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask	Paper
Prompt Engineering
Paper	Link
1) RealCompo: Dynamic Equilibrium between Realism and Compositionality Improves Text-to-Image Diffusion Models	Paper, GitHub, Project
2) Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs	Paper, GitHub
3) LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models	TMLR 23 Paper, GitHub
4) LLM BLUEPRINT: ENABLING TEXT-TO-IMAGE GEN-ERATION WITH COMPLEX AND DETAILED PROMPTS	ICLR 24 Paper, GitHub
5) Progressive Text-to-Image Diffusion with Soft Latent Direction	Paper
6) Self-correcting LLM-controlled Diffusion Models	CVPR 24 Paper, GitHub
7) LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation	MM 23 Paper
8) LayoutGPT: Compositional Visual Planning and Generation with Large Language Models	NeurIPS 23 Paper, GitHub
9) Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition	Paper, GitHub
10) InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions	Paper, GitHub
11) Controllable Text-to-Image Generation with GPT-4	Paper
12) LLM-grounded Video Diffusion Models	ICLR 24 Paper
13) VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning	Paper
14) FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax	Paper, Github, Project
15) VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM	Paper
16) Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator	NeurIPS 23 Paper
17) Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models	Paper
18) MotionZero: Exploiting Motion Priors for Zero-shot Text-to-Video Generation	Paper
19) GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning	Paper
20) Multimodal Procedural Planning via Dual Text-Image Prompting	Paper, Github
21) InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists	ICLR 24 Paper, Github
22) DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback	Paper
23) TaleCrafter: Interactive Story Visualization with Multiple Characters	SIGGRAPH Asia 23 Paper
24) Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis	Paper, Github
25) COLE: A Hierarchical Generation Framework for Graphic Design	Paper
26) Knowledge-Aware Artifact Image Synthesis with LLM-Enhanced Prompting and Multi-Source Supervision	Paper
27) Vlogger: Make Your Dream A Vlog	CVPR 24 Paper, Github
28) GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting	Paper
29) MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion	Paper
30) ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment	Paper, Github, Project
Recaption
Paper	Link
1) LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models	Paper, GitHub
2) Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation	Paper, GitHub
3) CoCa: Contrastive Captioners are Image-Text Foundation Models	Paper, Github
4) CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion	Paper
5) VideoChat: Chat-Centric Video Understanding	CVPR 24 Paper, Github
6) De-Diffusion Makes Text a Strong Cross-Modal Interface	Paper
7) HowToCaption: Prompting LLMs to Transform Video Annotations at Scale	Paper
8) ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment (-)	Paper, Github
9) SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data	Paper
10) LLMGA: Multimodal Large Language Model based Generation Assistant	Paper, Github
Security
Paper	Link
1) BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset	NeurIPS 23 Paper, Github
2) LIMA: Less Is More for Alignment	NeurIPS 23 Paper
3) Jailbroken: How Does LLM Safety Training Fail?	NeurIPS 23 Paper
4) Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models	CVPR 23 Paper
5) Stable Bias: Evaluating Societal Representations in Diffusion Models	NeurIPS 23 Paper
6) Ablating concepts in text-to-image diffusion models	ICCV 23 Paper
7) Diffusion art or digital forgery? investigating data replication in diffusion models	ICCV 23 Paper, Project
8) Eternal Sunshine of the Spotless Net: Selective Forgetting in Deep Networks	ICCV 20 Paper
9) Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks	ICML 20 Paper
10) A pilot study of query-free adversarial attack against stable diffusion	ICCV 23 Paper
11) Interpretable-Through-Prototypes Deepfake Detection for Diffusion Models	ICCV 23 Paper
12) Erasing Concepts from Diffusion Models	ICCV 23 Paper, Project
13) Ablating Concepts in Text-to-Image Diffusion Models	ICCV 23 Paper, Project
14) BEAVERTAILS: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset	NeurIPS 23 Paper, Project
15) LIMA: Less Is More for Alignment	NeurIPS 23 Paper
16) Stable Bias: Evaluating Societal Representations in Diffusion Models	NeurIPS 23 Paper
17) Threat Model-Agnostic Adversarial Defense using Diffusion Models	Paper
18) How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions?	Paper, Github
19) Differentially Private Diffusion Models Generate Useful Synthetic Images	Paper
20) Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models	SIGSAC 23 Paper, Github
21) Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models	Paper, Github
22) Unified Concept Editing in Diffusion Models	WACV 24 Paper, Project
23) Diffusion Model Alignment Using Direct Preference Optimization	Paper
24) RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment	TMLR 23 Paper , Github
25) Self-Alignment of Large Language Models via Monopolylogue-based Social Scene Simulation	Paper, Github, Project
World Model
Paper	Link
1) NExT-GPT: Any-to-Any Multimodal LLM	Paper, GitHub
Video Compression
Paper	Link
1) H.261: Video codec for audiovisual services at p x 64 kbit/s	Paper
2) H.262: Information technology - Generic coding of moving pictures and associated audio information: Video	Paper
3) H.263: Video coding for low bit rate communication	Paper
4) H.264: Overview of the H.264/AVC video coding standard	Paper
5) H.265: Overview of the High Efficiency Video Coding (HEVC) Standard	Paper
6) H.266: Overview of the Versatile Video Coding (VVC) Standard and its Applications	Paper
7) DVC: An End-to-end Deep Video Compression Framework	CVPR 19 Paper, GitHub
8) OpenDVC: An Open Source Implementation of the DVC Video Compression Method	Paper, GitHub
9) HLVC: Learning for Video Compression with Hierarchical Quality and Recurrent Enhancement	CVPR 20 Paper, Github
10) RLVC: Learning for Video Compression with Recurrent Auto-Encoder and Recurrent Probability Model	J-STSP 21 Paper, Github
11) PLVC: Perceptual Learned Video Compression with Recurrent Conditional GAN	IJCAI 22 Paper, Github
12) ALVC: Advancing Learned Video Compression with In-loop Frame Prediction	T-CSVT 22 Paper, Github
13) DCVC: Deep Contextual Video Compression	NeurIPS 21 Paper, Github
14) DCVC-TCM: Temporal Context Mining for Learned Video Compression	TM 22 Paper, Github
15) DCVC-HEM: Hybrid Spatial-Temporal Entropy Modelling for Neural Video Compression	MM 22 Paper, Github
16) DCVC-DC: Neural Video Compression with Diverse Contexts	CVPR 23 Paper, Github
17) DCVC-FM: Neural Video Compression with Feature Modulation	CVPR 24 Paper, Github
18) SSF: Scale-Space Flow for End-to-End Optimized Video Compression	CVPR 20 Paper, Github
Mamba
Theoretical Foundations and Model Architecture
Paper	Link
1) Mamba: Linear-Time Sequence Modeling with Selective State Spaces	Paper, Github
2) Efficiently Modeling Long Sequences with Structured State Spaces	ICLR 22 Paper, Github
3) Modeling Sequences with Structured State Spaces	Paper
4) Long Range Language Modeling via Gated State Spaces	Paper, GitHub
Image Generation and Visual Applications
Paper	Link
1) Diffusion Models Without Attention	Paper
2) Pan-Mamba: Effective Pan-Sharpening with State Space Model	Paper, Github
3) Pretraining Without Attention	Paper, Github
4) Block-State Transformers	NIPS 23 Paper
5) Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model	Paper, Github
6) VMamba: Visual State Space Model	Paper, Github
Video Processing and Understanding
Paper	Link
1) Long Movie Clip Classification with State-Space Video Models	ECCV 22 Paper, Github
2) Selective Structured State-Spaces for Long-Form Video Understanding	CVPR 23 Paper
3) Efficient Movie Scene Detection Using State-Space Transformers	CVPR 23 Paper, Github
4) VideoMamba: State Space Model for Efficient Video Understanding	Paper, Github
Medical Image Processing
Paper	Link
1) Swin-UMamba: Mamba-based UNet with ImageNet-based pretraining	Paper, Github
2) MambaIR: A Simple Baseline for Image Restoration with State-Space Model	Paper, Github
3) VM-UNet: Vision Mamba UNet for Medical Image Segmentation	Paper, Github

Existing high-quality resources
Resources	Link
1) Datawhale - AI视频生成学习	Feishu doc
2) A Survey on Generative Diffusion Model	TKDE 24 Paper, GitHub
3) Awesome-Video-Diffusion-Models: A Survey on Video Diffusion Models	Paper, GitHub
4) Awesome-Text-To-Video：A Survey on Text-to-Video Generation/Synthesis	GitHub
5) video-generation-survey: A reading list of video generation	GitHub
6) Awesome-Video-Diffusion	GitHub
7) Video Generation Task in Papers With Code	Task
8) Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models	Paper, GitHub
9) Open-Sora-Plan (PKU-YuanGroup)	GitHub
10) State of the Art on Diffusion Models for Visual Computing	Paper
11) Diffusion Models: A Comprehensive Survey of Methods and Applications	CSUR 24 Paper, GitHub
12) Generate Impressive Videos with Text Instructions: A Review of OpenAI Sora, Stable Diffusion, Lumiere and Comparable	Paper
13) On the Design Fundamentals of Diffusion Models: A Survey	Paper
14) Efficient Diffusion Models for Vision: A Survey	Paper
15) Text-to-Image Diffusion Models in Generative AI: A Survey	Paper
16) Awesome-Diffusion-Transformers	GitHub, Project
17) Open-Sora (HPC-AI Tech)	GitHub, Blog
18) LAVIS - A Library for Language-Vision Intelligence	ACL 23 Paper, GitHub, Project
19) OpenDiT: An Easy, Fast and Memory-Efficient System for DiT Training and Inference	GitHub
20) Awesome-Long-Context	GitHub1, GitHub2

Citation

If this project is helpful to your work, please cite it using the following format:

@misc{minisora,
    title={MiniSora},
    author={MiniSora Community},
    url={https://github.com/mini-sora/minisora},
    year={2024}
}

@misc{minisora,
    title={Diffusion Model-based Video Generation Models From DDPM to Sora: A Survey},
    author={Survey Paper Group of MiniSora Community},
    url={https://github.com/mini-sora/minisora},
    year={2024}
}