English | 简体中文
👋 join us on WeChat
The MiniSora open-source community is positioned as a community-driven initiative organized spontaneously by community members. The MiniSora community aims to explore the implementation path and future development direction of Sora.
- Regular round-table discussions will be held with the Sora team and the community to explore possibilities.
- We will delve into existing technological pathways for video generation.
- Leading the replication of papers or research results related to Sora, such as DiT (MiniSora-DiT), etc.
- Conducting a comprehensive review of Sora-related technologies and their implementations, i.e., "From DDPM to Sora: A Review of Video Generation Models Based on Diffusion Models".
- Stable Diffusion 3: MM-DiT: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
- MiniSora-DiT: Reproducing the DiT Paper with XTuner
- Introduction of MiniSora and Latest Progress in Replicating Sora
- GPU-Friendly: Ideally, it should have low requirements for GPU memory size and the number of GPUs, such as being trainable and inferable with compute power like 8 A100 80G cards, 8 A6000 48G cards, or RTX4090 24G.
- Training-Efficiency: It should achieve good results without requiring extensive training time.
- Inference-Efficiency: When generating videos during inference, there is no need for high length or resolution; acceptable parameters include 3-10 seconds in length and 480p resolution.
MiniSora-DiT: Reproducing the DiT Paper with XTuner
https://github.com/mini-sora/minisora-DiT
We are recruiting MiniSora Community contributors to reproduce DiT
using XTuner.
We hope the community member has the following characteristics:
- Familiarity with the
OpenMMLab MMEngine
mechanism. - Familiarity with
DiT
.
- The author of
DiT
is the same as the author ofSora
. - XTuner has the core technology to efficiently train sequences of length
1000K
.
Speaker: MMagic Core Contributors
Live Streaming Time: 03/12 20:00
Highlights: MMagic core contributors will lead us in interpreting the Stable Diffusion 3 paper, discussing the architecture details and design principles of Stable Diffusion 3.
PPT: FeiShu Link
ZhiHu Notes: A Survey on Generative Diffusion Model: An Overview of Generative Diffusion Models
-
Technical Report: Video generation models as world simulators
-
Latte: Latte: Latent Diffusion Transformer for Video Generation
-
Stable Cascade (ICLR 24 Paper): Würstchen: An efficient architecture for large-scale text-to-image diffusion models
-
Stable Diffusion 3: MM-DiT: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
-
Updating...
- 01 Diffusion Model
- 02 Diffusion Transformer
- 03 Baseline Video Generation Models
- 04 Video Generation
- 05 Dataset
- 06 Patchifying Methods
- 07 Long-context
- 08 Audio Related Resource
- 09 Consistency
- 10 Prompt Engineering
- 11 Security
- 12 World Model
- 13 Video Compression
- 14 Mamba
- 15 Existing high-quality resources
Paper | Link |
1) Guided-Diffusion: Diffusion Models Beat GANs on Image Synthesis | NeurIPS 21 Paper, GitHub |
2) Latent Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models | CVPR 22 Paper, GitHub |
3) EDM: Elucidating the Design Space of Diffusion-Based Generative Models | NeurIPS 22 Paper, GitHub |
4) DDPM: Denoising Diffusion Probabilistic Models | NeurIPS 20 Paper, GitHub |
5) DDIM: Denoising Diffusion Implicit Models | ICLR 21 Paper, GitHub |
6) Score-Based Diffusion: Score-Based Generative Modeling through Stochastic Differential Equations | ICLR 21 Paper, GitHub, Blog |
7) Stable Cascade: Würstchen: An efficient architecture for large-scale text-to-image diffusion models | ICLR 24 Paper, GitHub, Blog |
8) Diffusion Models in Vision: A Survey | TPAMI 23 Paper, GitHub |
9) Improved DDPM: Improved Denoising Diffusion Probabilistic Models | ICML 21 Paper, Github |
Paper | Link |
1) UViT: All are Worth Words: A ViT Backbone for Diffusion Models | CVPR 23 Paper, GitHub, ModelScope |
2) DiT: Scalable Diffusion Models with Transformers | ICCV 23 Paper, GitHub, Project, ModelScope |
3) SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers | Paper, GitHub, ModelScope |
4) FiT: Flexible Vision Transformer for Diffusion Model | Paper, GitHub |
5) k-diffusion: Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers | Paper, GitHub |
6) Large-DiT: Large Diffusion Transformer | GitHub |
7) VisionLLaMA: A Unified LLaMA Interface for Vision Tasks | Paper, GitHub |
8) Stable Diffusion 3: MM-DiT: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis | Paper, Blog |
9) PIXART-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation | Paper, Project |
10) PIXART-α: Fast Training of Diffusion Transformer for Photorealistic Text-To-Image Synthesiss | Paper, GitHub |
11) PIXART-δ: Fast and Controllable Image Generation With Latent Consistency Model | Paper, |
Paper | Link |
1) ViViT: A Video Vision Transformer | ICCV 21 Paper, GitHub |
2) VideoLDM: Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models | CVPR 23 Paper |
3) DiT: Scalable Diffusion Models with Transformers | ICCV 23 Paper, Github, Project, ModelScope |
4) Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators | Paper, GitHub |
5) Latte: Latent Diffusion Transformer for Video Generation | Paper, GitHub, Project |
Paper | Link |
1) Animatediff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning | ICLR 24 Paper, GitHub, ModelScope |
2) I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models | Paper, GitHub, ModelScope |
3) Imagen Video: High Definition Video Generation with Diffusion Models | Paper |
4) MoCoGAN: Decomposing Motion and Content for Video Generation | CVPR 18 Paper |
5) Adversarial Video Generation on Complex Datasets | Paper |
6) W.A.L.T: Photorealistic Video Generation with Diffusion Models | Paper, Project |
7) VideoGPT: Video Generation using VQ-VAE and Transformers | Paper, GitHub |
8) Video Diffusion Models | Paper, GitHub, Project |
9) MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation | NeurIPS 22 Paper, GitHub, Project, Blog |
10) VideoPoet: A Large Language Model for Zero-Shot Video Generation | Paper, Project, Blog |
11) MAGVIT: Masked Generative Video Transformer | CVPR 23 Paper, GitHub, Project, Colab |
12) EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions | Paper, GitHub, Project |
13) SimDA: Simple Diffusion Adapter for Efficient Video Generation | Paper, GitHub, Project |
14) StableVideo: Text-driven Consistency-aware Diffusion Video Editing | ICCV 23 Paper, GitHub, Project |
15) SVD: Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets | Paper, GitHub |
16) ADD: Adversarial Diffusion Distillation | Paper, GitHub |
17) GenTron: Diffusion Transformers for Image and Video Generation | CVPR 24 Paper, Project |
18) LFDM: Conditional Image-to-Video Generation with Latent Flow Diffusion Models | CVPR 23 Paper, GitHub |
19) MotionDirector: Motion Customization of Text-to-Video Diffusion Models | Paper, GitHub |
20) TGAN-ODE: Latent Neural Differential Equations for Video Generation | Paper, GitHub |
21) VideoCrafter1: Open Diffusion Models for High-Quality Video Generation | Paper, GitHub |
22) VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models | Paper, GitHub |
23) LVDM: Latent Video Diffusion Models for High-Fidelity Long Video Generation | Paper, GitHub |
24) LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models | Paper, GitHub ,Project |
25) PYoCo: Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models | ICCV 23 Paper, Project |
26) VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation | CVPR 23 Paper |
Dataset Name - Paper | Link |
1) Panda-70M - Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers70M Clips, 720P, Downloadable |
CVPR 24 Paper, Github, Project |
2) InternVid-10M - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation10M Clips, 720P, Downloadable |
ArXiv 24 Paper, Github |
3) CelebV-Text - CelebV-Text: A Large-Scale Facial Text-Video Dataset70K Clips, 720P, Downloadable |
CVPR 23 Paper, Github, Project |
4) HD-VG-130M - VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation130M Clips, 720P, Downloadable |
ArXiv 23 Paper, Github, Tool |
5) HD-VILA-100M - Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions100M Clips, 720P, Downloadable |
CVPR 22 Paper, Github |
6) VideoCC - Learning Audio-Video Modalities from Image Captions10.3M Clips, 720P, Downloadable |
ECCV 22 Paper, Github |
7) YT-Temporal-180M - MERLOT: Multimodal Neural Script Knowledge Models180M Clips, 480P, Downloadable |
NeurIPS 21 Paper, Github, Project |
8) HowTo100M - HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips136M Clips, 240P, Downloadable |
ICCV 19 Paper, Github, Project |
9) UCF101 - UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild13K Clips, 240P, Downloadable |
CVPR 12 Paper, Project |
10) MSVD - Collecting Highly Parallel Data for Paraphrase Evaluation122K Clips, 240P, Downloadable |
ACL 11 Paper, Project |
11) Fashion-Text2Video - A human video dataset with rich label and text annotations600 Videos, 480P, Downloadable |
ArXiv 23 Paper, Project |
12) LAION-5B - A dataset of 5,85 billion CLIP-filtered image-text pairs, 14x bigger than LAION-400M5B Clips, Downloadable |
NeurIPS 22 Paper, Project |
13) ActivityNet Captions - ActivityNet Captions contains 20k videos amounting to 849 video hours with 100k total descriptions, each with its unique start and end time20k videos, Downloadable |
Arxiv 17 Paper, Project |
14) MSR-VTT - A large-scale video benchmark for video understanding10k Clips, Downloadable |
CVPR 16 Paper, Project |
15) The Cityscapes Dataset - Benchmark suite and evaluation server for pixel-level, instance-level, and panoptic semantic labelingDownloadable |
Arxiv 16 Paper, Project |
16) Youku-mPLUG - First open-source large-scale Chinese video text datasetDownloadable |
Arxiv 23 Paper, Project |
17) VidProM - VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models6.69M, Downloadable |
Arxiv 24 Paper, Github |
18) Pixabay100 - A video dataset collected from PixabayDownloadable |
Github |
1) WebVid - Large-scale text-video dataset, containing 10 million video-text pairs scraped from the stock footage sites10M video-text pairs |
Arxiv 21 Paper, Project |
Three-stream CNNs for action recognition | PRL 17 Paper |
Dynamic Hand Gesture Recognition Using Multi-direction 3D Convolutional Neural Networks | EL 19 Paper |
Intra-clip Aggregation for Video Person Re-identification | ICIP 20 Paper |
VideoMix: Rethinking Data Augmentation for Video Classification | CVPR 20 Paper |
mixup: Beyond Empirical Risk Minimization | ICLR 17 Paper |
CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features | ICCV 19 Paper |
Video Salient Object Detection via Fully Convolutional Networks | ICIP 18 Paper |
Illumination-Based Data Augmentation for Robust Background Subtraction | SKIMA 19 Paper |
Image editing-based data augmentation for illumination-insensitive background subtraction | EIM 20 Paper |
Universal Semantic Segmentation for Fisheye Urban Driving Images | SMC 20 Paper |
Feature Re-Learning with Data Augmentation for Content-based Video Recommendation | ACM 18 Paper |
GAC-GAN: A General Method for Appearance-Controllable Human Video Motion Transfer | Trans 21 Paper |
Deep Video-Based Performance Cloning | CVPR 18 Paper |
Adversarial Action Data Augmentation for Similar Gesture Action Recognition | IJCNN 19 Paper |
Self-Paced Video Data Augmentation by Generative Adversarial Networks with Insufficient Samples | MM 20 Paper |
GAC-GAN: A General Method for Appearance-Controllable Human Video Motion Transfer | Trans 20 Paper |
Dynamic Facial Expression Generation on Hilbert Hypersphere With Conditional Wasserstein Generative Adversarial Nets | TPAMI 20 Paper |
CrowdGAN: Identity-Free Interactive Crowd Video Generation and Beyond | TPAMI 22 Paper |
Rotationally-Temporally Consistent Novel View Synthesis of Human Performance Video | ECCV 20 Paper |
Autoencoder-based Data Augmentation for Deepfake Detection | ACM 23 Paper |
A data augmentation methodology for training machine/deep learning gait recognition algorithms | CVPR 16 Paper |
ElderSim: A Synthetic Data Generation Platform for Human Action Recognition in Eldercare Applications | IEEE 21 Paper |
Mid-Air: A Multi-Modal Dataset for Extremely Low Altitude Drone Flights | CVPR 19 Paper |
Generating Human Action Videos by Coupling 3D Game Engines and Probabilistic Graphical Models | IJCV 19 Paper |
Using synthetic data for person tracking under adverse weather conditions | IVC 21 Paper |
Unlimited Road-scene Synthetic Annotation (URSA) Dataset | ITSC 18 Paper |
SAIL-VOS 3D: A Synthetic Dataset and Baselines for Object Detection and 3D Mesh Reconstruction From Video Data | CVPR 21 Paper |
Paper | Link |
1) ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale | CVPR 21 Paper, Github |
2) MAE: Masked Autoencoders Are Scalable Vision Learners | CVPR 22 Paper, Github |
3) ViViT: A Video Vision Transformer (-) | ICCV 21 Paper, GitHub |
4) DiT: Scalable Diffusion Models with Transformers (-) | ICCV 23 Paper, GitHub, Project, ModelScope |
5) U-ViT: All are Worth Words: A ViT Backbone for Diffusion Models (-) | CVPR 23 Paper, GitHub, ModelScope |
6) FlexiViT: One Model for All Patch Sizes | Paper, Github |
7) Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution | Paper, Github |
8) VQ-VAE: Neural Discrete Representation Learning | Paper, Github |
9) VQ-GAN: Neural Discrete Representation Learning | CVPR 21 Paper, Github |
10) LVT: Latent Video Transformer | Paper, Github |
11) VideoGPT: Video Generation using VQ-VAE and Transformers (-) | Paper, GitHub |
12) Predicting Video with VQVAE | Paper |
13) CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers | ICLR 23 Paper, Github |
14) TATS: Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer | ECCV 22 Paper, Github |
15) MAGVIT: Masked Generative Video Transformer (-) | CVPR 23 Paper, GitHub, Project, Colab |
16) MagViT2: Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation | ICLR 24 Paper, Github |
17) VideoPoet: A Large Language Model for Zero-Shot Video Generation (-) | Paper, Project, Blog |
18) CLIP: Learning Transferable Visual Models From Natural Language Supervision | CVPR 21 Paper, Github |
19) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation | Paper, Github |
20) BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models | Paper, Github |
Paper | Link |
1) World Model on Million-Length Video And Language With RingAttention | Paper, GitHub |
2) Ring Attention with Blockwise Transformers for Near-Infinite Context | Paper, GitHub |
3) Extending LLMs' Context Window with 100 Samples | Paper, GitHub |
4) Efficient Streaming Language Models with Attention Sinks | ICLR 24 Paper, GitHub |
5) The What, Why, and How of Context Length Extension Techniques in Large Language Models – A Detailed Survey | Paper |
6) MovieChat: From Dense Token to Sparse Memory for Long Video Understanding | CVPR 24 Paper, GitHub, Project |
7) MemoryBank: Enhancing Large Language Models with Long-Term Memory | Paper, GitHub |
Paper | Link |
1) Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion | Paper, Github, Blog |
2) MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation | CVPR 23 Paper, GitHub |
3) Pengi: An Audio Language Model for Audio Tasks | NeurIPS 23 Paper, GitHub |
4) Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset | NeurlPS 23 Paper, GitHub |
5) Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration | Paper, GitHub |
6) NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality | TPAMI 24 Paper, GitHub |
7) NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers | ICLR 24 Paper, GitHub |
8) UniAudio: An Audio Foundation Model Toward Universal Audio Generation | Paper, GitHub |
9) Diffsound: Discrete Diffusion Model for Text-to-sound Generation | TASLP 22 Paper |
10) AudioGen: Textually Guided Audio Generation | ICLR 23 Paper, Project |
11) AudioLDM: Text-to-audio generation with latent diffusion models | ICML 23 Paper, GitHub, Project, Huggingface |
12) AudioLDM2: Learning Holistic Audio Generation with Self-supervised Pretraining | Paper, GitHub, Project, Huggingface |
13) Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models | ICML 23 Paper, GitHub |
14) Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation | Paper |
15) TANGO: Text-to-audio generation using instruction-tuned LLM and latent diffusion model | Paper, GitHub, Project, Huggingface |
16) AudioLM: a Language Modeling Approach to Audio Generation | Paper |
17) AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head | Paper, GitHub |
18) MusicGen: Simple and Controllable Music Generation | NeurIPS 23 Paper, GitHub |
19) LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT | Paper |
20) Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners | CVPR 24 Paper |
21) Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding | EMNLP 23 Paper |
22) Audio-Visual LLM for Video Understanding | Paper |
23) VideoPoet: A Large Language Model for Zero-Shot Video Generation (-) | Paper, Project, Blog |
Paper | Link |
1) Consistency Models | Paper, GitHub |
2) Improved Techniques for Training Consistency Models | Paper |
3) Score-Based Diffusion: Score-Based Generative Modeling through Stochastic Differential Equations (-) | ICLR 21 Paper, GitHub, Blog |
4) Improved Techniques for Training Score-Based Generative Models | NIPS 20 Paper, GitHub |
4) Generative Modeling by Estimating Gradients of the Data Distribution | NIPS 19 Paper, GitHub |
5) Maximum Likelihood Training of Score-Based Diffusion Models | NIPS 21 Paper, GitHub |
6) Layered Neural Atlases for Consistent Video Editing | TOG 21 Paper, GitHub, Project |
7) StableVideo: Text-driven Consistency-aware Diffusion Video Editing | ICCV 23 Paper, GitHub, Project |
8) CoDeF: Content Deformation Fields for Temporally Consistent Video Processing | Paper, GitHub, Project |
9) Sora Generates Videos with Stunning Geometrical Consistency | Paper, GitHub, Project |
10) Efficient One-stage Video Object Detection by Exploiting Temporal Consistency | ECCV 22 Paper, GitHub |
11) Bootstrap Motion Forecasting With Self-Consistent Constraints | ICCV 23 Paper |
12) Enforcing Realism and Temporal Consistency for Large-Scale Video Inpainting | Paper |
13) Enhancing Multi-Camera People Tracking with Anchor-Guided Clustering and Spatio-Temporal Consistency ID Re-Assignment | CVPRW 23 Paper, GitHub |
14) Exploiting Spatial-Temporal Semantic Consistency for Video Scene Parsing | Paper |
15) Semi-Supervised Crowd Counting With Spatial Temporal Consistency and Pseudo-Label Filter | TCSVT 23 Paper |
16) Spatio-temporal Consistency and Hierarchical Matching for Multi-Target Multi-Camera Vehicle Tracking | CVPRW 19 Paper |
17) VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning (-) | Paper |
18) VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM (-) | Paper |
19) MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask | Paper |
Paper | Link |
1) RealCompo: Dynamic Equilibrium between Realism and Compositionality Improves Text-to-Image Diffusion Models | Paper, GitHub, Project |
2) Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs | Paper, GitHub |
3) LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models | TMLR 23 Paper, GitHub |
4) LLM BLUEPRINT: ENABLING TEXT-TO-IMAGE GEN-ERATION WITH COMPLEX AND DETAILED PROMPTS | ICLR 24 Paper, GitHub |
5) Progressive Text-to-Image Diffusion with Soft Latent Direction | Paper |
6) Self-correcting LLM-controlled Diffusion Models | CVPR 24 Paper, GitHub |
7) LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation | MM 23 Paper |
8) LayoutGPT: Compositional Visual Planning and Generation with Large Language Models | NeurIPS 23 Paper, GitHub |
9) Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition | Paper, GitHub |
10) InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions | Paper, GitHub |
11) Controllable Text-to-Image Generation with GPT-4 | Paper |
12) LLM-grounded Video Diffusion Models | ICLR 24 Paper |
13) VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning | Paper |
14) FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax | Paper, Github, Project |
15) VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM | Paper |
16) Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator | NeurIPS 23 Paper |
17) Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models | Paper |
18) MotionZero: Exploiting Motion Priors for Zero-shot Text-to-Video Generation | Paper |
19) GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning | Paper |
20) Multimodal Procedural Planning via Dual Text-Image Prompting | Paper, Github |
21) InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists | ICLR 24 Paper, Github |
22) DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback | Paper |
23) TaleCrafter: Interactive Story Visualization with Multiple Characters | SIGGRAPH Asia 23 Paper |
24) Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis | Paper, Github |
25) COLE: A Hierarchical Generation Framework for Graphic Design | Paper |
26) Knowledge-Aware Artifact Image Synthesis with LLM-Enhanced Prompting and Multi-Source Supervision | Paper |
27) Vlogger: Make Your Dream A Vlog | CVPR 24 Paper, Github |
28) GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting | Paper |
29) MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion | Paper |
30) ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment | Paper, Github, Project |
Paper | Link |
1) LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models | Paper, GitHub |
2) Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation | Paper, GitHub |
3) CoCa: Contrastive Captioners are Image-Text Foundation Models | Paper, Github |
4) CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion | Paper |
5) VideoChat: Chat-Centric Video Understanding | CVPR 24 Paper, Github |
6) De-Diffusion Makes Text a Strong Cross-Modal Interface | Paper |
7) HowToCaption: Prompting LLMs to Transform Video Annotations at Scale | Paper |
8) ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment (-) | Paper, Github |
9) SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data | Paper |
10) LLMGA: Multimodal Large Language Model based Generation Assistant | Paper, Github |
Paper | Link |
1) BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset | NeurIPS 23 Paper, Github |
2) LIMA: Less Is More for Alignment | NeurIPS 23 Paper |
3) Jailbroken: How Does LLM Safety Training Fail? | NeurIPS 23 Paper |
4) Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models | CVPR 23 Paper |
5) Stable Bias: Evaluating Societal Representations in Diffusion Models | NeurIPS 23 Paper |
6) Ablating concepts in text-to-image diffusion models | ICCV 23 Paper |
7) Diffusion art or digital forgery? investigating data replication in diffusion models | ICCV 23 Paper, Project |
8) Eternal Sunshine of the Spotless Net: Selective Forgetting in Deep Networks | ICCV 20 Paper |
9) Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks | ICML 20 Paper |
10) A pilot study of query-free adversarial attack against stable diffusion | ICCV 23 Paper |
11) Interpretable-Through-Prototypes Deepfake Detection for Diffusion Models | ICCV 23 Paper |
12) Erasing Concepts from Diffusion Models | ICCV 23 Paper, Project |
13) Ablating Concepts in Text-to-Image Diffusion Models | ICCV 23 Paper, Project |
14) BEAVERTAILS: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset | NeurIPS 23 Paper, Project |
15) LIMA: Less Is More for Alignment | NeurIPS 23 Paper |
16) Stable Bias: Evaluating Societal Representations in Diffusion Models | NeurIPS 23 Paper |
17) Threat Model-Agnostic Adversarial Defense using Diffusion Models | Paper |
18) How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions? | Paper, Github |
19) Differentially Private Diffusion Models Generate Useful Synthetic Images | Paper |
20) Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models | SIGSAC 23 Paper, Github |
21) Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models | Paper, Github |
22) Unified Concept Editing in Diffusion Models | WACV 24 Paper, Project |
23) Diffusion Model Alignment Using Direct Preference Optimization | Paper |
24) RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment | TMLR 23 Paper , Github |
25) Self-Alignment of Large Language Models via Monopolylogue-based Social Scene Simulation | Paper, Github, Project |
Paper | Link |
1) NExT-GPT: Any-to-Any Multimodal LLM | Paper, GitHub |
Paper | Link |
1) H.261: Video codec for audiovisual services at p x 64 kbit/s | Paper |
2) H.262: Information technology - Generic coding of moving pictures and associated audio information: Video | Paper |
3) H.263: Video coding for low bit rate communication | Paper |
4) H.264: Overview of the H.264/AVC video coding standard | Paper |
5) H.265: Overview of the High Efficiency Video Coding (HEVC) Standard | Paper |
6) H.266: Overview of the Versatile Video Coding (VVC) Standard and its Applications | Paper |
7) DVC: An End-to-end Deep Video Compression Framework | CVPR 19 Paper, GitHub |
8) OpenDVC: An Open Source Implementation of the DVC Video Compression Method | Paper, GitHub |
9) HLVC: Learning for Video Compression with Hierarchical Quality and Recurrent Enhancement | CVPR 20 Paper, Github |
10) RLVC: Learning for Video Compression with Recurrent Auto-Encoder and Recurrent Probability Model | J-STSP 21 Paper, Github |
11) PLVC: Perceptual Learned Video Compression with Recurrent Conditional GAN | IJCAI 22 Paper, Github |
12) ALVC: Advancing Learned Video Compression with In-loop Frame Prediction | T-CSVT 22 Paper, Github |
13) DCVC: Deep Contextual Video Compression | NeurIPS 21 Paper, Github |
14) DCVC-TCM: Temporal Context Mining for Learned Video Compression | TM 22 Paper, Github |
15) DCVC-HEM: Hybrid Spatial-Temporal Entropy Modelling for Neural Video Compression | MM 22 Paper, Github |
16) DCVC-DC: Neural Video Compression with Diverse Contexts | CVPR 23 Paper, Github |
17) DCVC-FM: Neural Video Compression with Feature Modulation | CVPR 24 Paper, Github |
18) SSF: Scale-Space Flow for End-to-End Optimized Video Compression | CVPR 20 Paper, Github |
Paper | Link |
1) Mamba: Linear-Time Sequence Modeling with Selective State Spaces | Paper, Github |
2) Efficiently Modeling Long Sequences with Structured State Spaces | ICLR 22 Paper, Github |
3) Modeling Sequences with Structured State Spaces | Paper |
4) Long Range Language Modeling via Gated State Spaces | Paper, GitHub |
Paper | Link |
1) Diffusion Models Without Attention | Paper |
2) Pan-Mamba: Effective Pan-Sharpening with State Space Model | Paper, Github |
3) Pretraining Without Attention | Paper, Github |
4) Block-State Transformers | NIPS 23 Paper |
5) Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model | Paper, Github |
6) VMamba: Visual State Space Model | Paper, Github |
Paper | Link |
1) Long Movie Clip Classification with State-Space Video Models | ECCV 22 Paper, Github |
2) Selective Structured State-Spaces for Long-Form Video Understanding | CVPR 23 Paper |
3) Efficient Movie Scene Detection Using State-Space Transformers | CVPR 23 Paper, Github |
4) VideoMamba: State Space Model for Efficient Video Understanding | Paper, Github |
Paper | Link |
1) Swin-UMamba: Mamba-based UNet with ImageNet-based pretraining | Paper, Github |
2) MambaIR: A Simple Baseline for Image Restoration with State-Space Model | Paper, Github |
3) VM-UNet: Vision Mamba UNet for Medical Image Segmentation | Paper, Github |
Resources | Link |
1) Datawhale - AI视频生成学习 | Feishu doc |
2) A Survey on Generative Diffusion Model | TKDE 24 Paper, GitHub |
3) Awesome-Video-Diffusion-Models: A Survey on Video Diffusion Models | Paper, GitHub |
4) Awesome-Text-To-Video:A Survey on Text-to-Video Generation/Synthesis | GitHub |
5) video-generation-survey: A reading list of video generation | GitHub |
6) Awesome-Video-Diffusion | GitHub |
7) Video Generation Task in Papers With Code | Task |
8) Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models | Paper, GitHub |
9) Open-Sora-Plan (PKU-YuanGroup) | GitHub |
10) State of the Art on Diffusion Models for Visual Computing | Paper |
11) Diffusion Models: A Comprehensive Survey of Methods and Applications | CSUR 24 Paper, GitHub |
12) Generate Impressive Videos with Text Instructions: A Review of OpenAI Sora, Stable Diffusion, Lumiere and Comparable | Paper |
13) On the Design Fundamentals of Diffusion Models: A Survey | Paper |
14) Efficient Diffusion Models for Vision: A Survey | Paper |
15) Text-to-Image Diffusion Models in Generative AI: A Survey | Paper |
16) Awesome-Diffusion-Transformers | GitHub, Project |
17) Open-Sora (HPC-AI Tech) | GitHub, Blog |
18) LAVIS - A Library for Language-Vision Intelligence | ACL 23 Paper, GitHub, Project |
19) OpenDiT: An Easy, Fast and Memory-Efficient System for DiT Training and Inference | GitHub |
20) Awesome-Long-Context | GitHub1, GitHub2 |
If this project is helpful to your work, please cite it using the following format:
@misc{minisora,
title={MiniSora},
author={MiniSora Community},
url={https://github.com/mini-sora/minisora},
year={2024}
}
@misc{minisora,
title={Diffusion Model-based Video Generation Models From DDPM to Sora: A Survey},
author={Survey Paper Group of MiniSora Community},
url={https://github.com/mini-sora/minisora},
year={2024}
}
We greatly appreciate your contributions to the Mini Sora open-source community and helping us make it even better than it is now!
For more details, please refer to the Contribution Guidelines