Skip to content

Lumina-T2X is a unified framework for Text to Any Modality Generation

License

Notifications You must be signed in to change notification settings

Alpha-VLLM/Lumina-T2X

Repository files navigation

$\textbf{Lumina-T2X}$: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

arXiv  Badge  weixin  zhihu  zhihu  Static Badge

Static Badge Static Badge

Static Badge  Static Badge  Static Badge  Static Badge  Static Badge  Static Badge

Static Badge Static Badge Static Badge

intro_large

📰 News

  • [2024-06-08] 🚀🚀🚀 We have released the Lumina-Next-SFTmodel. MODEL
  • [2024-06-07] 🔥🔥🔥 We have released the Lumina-T2Music (Text-to-Music) code and model for music generation. MODEL DEMO
  • [2024-06-03] 🔥🔥🔥 We have released the Compositional Generation version of Lumina-Next-T2I, which enables compositional generation with multiple captions for different regions. model. DEMO
  • [2024-05-29] 🥰🥰🥰 We updated the new Lumina-Next-T2I Code and HF Model. Supporting 2K Resolution image generation and Time-aware Scaled RoPE.
  • [2024-05-25] We released training scripts for Flag-DiT and Next-DiT, and we have reported the comparison results between Next-DiT and Flag-DiT. Comparsion Results
  • [2024-05-21] Lumina-Next-T2I supports a higher-order solver. It can generate images in just 10 steps without any distillation. Try our demos DEMO.
  • [2024-05-18] We released training scripts for Lumina-T2I 5B. README
  • [2024-05-16] ❗❗❗ We have converted the .pth weights to .safetensors weights. Please pull the latest code and use demo.py for inference.
  • [2024-05-14] Lumina-Next now supports simple text-to-music generation (examples), high-resolution (1024*4096) Panorama generation conditioned on text (examples), and 3D point cloud generation conditioned on labels (examples).
  • [2024-05-13] We give examples demonstrating Lumina-T2X's capability to support multilingual prompts, and even support prompts containing emojis.
  • [2024-05-12] We excitedly released our Lumina-Next-T2I model (checkpoint) which uses a 2B Next-DiT model as the backbone and Gemma-2B as the text encoder. Try it out at demo1 & demo2 & demo3.
  • [2024-05-10] We released the technical report on arXiv.
  • [2024-05-09] We released Lumina-T2A (Text-to-Audio) Demos. Examples
  • [2024-04-29] We released the 5B model checkpoint and demo built upon it for text-to-image generation.
  • [2024-04-25] Support 720P video generation with arbitrary aspect ratio. Examples
  • [2024-04-19] Demo examples released.
  • [2024-04-05] Code released for Lumina-T2I.
  • [2024-04-01] We release the initial version of Lumina-T2I for text-to-image generation.

🚀 Quick Start

Warning

Since we are updating the code frequently, please pull the latest code:

git pull origin main

In order to quickly get you guys using our model, we built different versions of the GUI demo site.

Lumina-Next-T2I model demo:

Image Generation: [node1] [node2] [node3]

Image Compositional Generation: [node1]

Music Generation: [node1]

For more details about training and inference, please refer to Lumina-T2I and Lumina-Next-T2I

Warning

Lumina-T2X employs FSDP for training large diffusion models. FSDP shards parameters, optimizer states, and gradients across GPUs. Thus, at least 8 GPUs are required for full fine-tuning of the Lumina-T2X 5B model. Parameter-efficient Finetuning of Lumina-T2X shall be released soon.

Using Lumina-T2I as a library, using installation command on your environment:

pip install git+https://github.com/Alpha-VLLM/Lumina-T2X

If you want to contribute to the code, you should run command below to install pre-commit library:

git clone https://github.com/Alpha-VLLM/Lumina-T2X

cd Lumina-T2X
pip install -e ".[dev]"
pre-commit install
pre-commit

📑 Open-source Plan

  • Lumina-Text2Image (Demos✅, Training✅, Inference✅, Checkpoints✅)
  • Lumina-Text2Video (Demos✅)
  • Lumina-Text2Music (Demos✅, Inference✅, Checkpoints✅)
  • Web Demo
  • Cli Demo

📜 Index of Content

Introduction

We introduce the $\textbf{Lumina-T2X}$ family, a series of text-conditioned Diffusion Transformers (DiT) capable of transforming textual descriptions into vivid images, dynamic videos, detailed multi-view 3D images, and synthesized speech. At the core of Lumina-T2X lies the Flow-based Large Diffusion Transformer (Flag-DiT)—a robust engine that supports up to 7 billion parameters and extends sequence lengths to 128,000 tokens. Drawing inspiration from Sora, Lumina-T2X integrates images, videos, multi-views of 3D objects, and speech spectrograms within a spatial-temporal latent token space, and can generate outputs at any resolution, aspect ratio, and duration.

🌟 Features:

  • Flow-based Large Diffusion Transformer (Flag-DiT): Lumina-T2X adopts the flow matching formulation and is equipped with many advanced techniques, such as RoPE, RMSNorm, and KQ-norm, demonstrating faster training convergence, stable training dynamics, and a simplified pipeline.
  • Any Modalities, Resolution, and Duration within One Framework:
    1. $\textbf{Lumina-T2X}$ can encode any modality, including mages, videos, multi-views of 3D objects, and spectrograms into a unified 1-D token sequence at any resolution, aspect ratio, and temporal duration.
    2. By introducing the [nextline] and [nextframe] tokens, our model can support resolution extrapolation, i.e., generating images/videos with out-of-domain resolutions not encountered during training, such as images from 768x768 to 1792x1792 pixels.
  • Low Training Resources: Our empirical observations indicate that employing larger models, high-resolution images, and longer-duration video clips can significantly accelerate the convergence speed of diffusion transformers. Moreover, by employing meticulously curated text-image and text-video pairs featuring high aesthetic quality frames and detailed captions, our $\textbf{Lumina-T2X}$ model is learned to generate high-resolution images and coherent videos with minimal computational demands. Remarkably, the default Lumina-T2I configuration, equipped with a 5B Flag-DiT and a 7B LLaMA as the text encoder, requires only 35% of the computational resources compared to Pixelart-$\alpha$.

framework

📽️ Demo Examples

Demos of Lumina-Next-SFT

github_banner

Demos of Lumina-T2I


Panorama Generation


Text-to-Video Generation

720P Videos:

Prompt: The majestic beauty of a waterfall cascading down a cliff into a serene lake.

video_720p_1.mp4
video_720p_2.mp4

Prompt: A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

video_tokyo_woman.mp4

360P Videos:

video_360p.mp4

Text-to-3D Generation

multi_view.mp4

Point Cloud Generation


Text-to-Audio Generation

Note

Attention: Mouse over the playbar and click the audio button on the playbar to unmute it.

Prompt: Semiautomatic gunfire occurs with slight echo

Generated Audio:

semiautomatic_gunfire_occurs_with_slight_echo.mp4

Groundtruth:

semiautomatic_gunfire_occurs_with_slight_echo_gt.mp4

Prompt: A telephone bell rings

Generated Audio:

a_telephone_bell_rings.mp4

Groundtruth:

a_telephone_bell_rings_gt.mp4

Prompt: An engine running followed by the engine revving and tires screeching

Generated Audio:

an_engine_running_followed_by_the_engine_revving_and_tires_screeching.mp4

Groundtruth:

an_engine_running_followed_by_the_engine_revving_and_tires_screeching_gt.mp4

Prompt: Birds chirping with insects buzzing and outdoor ambiance

Generated Audio:

birds_chirping_repeatedly.mp4

Groundtruth:

birds_chirping_repeatedly_gt.mp4

Text-to-music Generation

Note

Attention: Mouse over the playbar and click the audio button on the playbar to unmute it. For more details check out this

Prompt: An electrifying ska tune with prominent saxophone riffs, energetic e-guitar and acoustic drums, lively percussion, soulful keys, groovy e-bass, and a fast tempo that exudes uplifting energy.

Generated Music:

electrifying.ska.mp4

Prompt: A high-energy synth rock/pop song with fast-paced acoustic drums, a triumphant brass/string section, and a thrilling synth lead sound that creates an adventurous atmosphere.

Generated Music:

high_energy.song.mp4

Prompt: An uptempo electronic pop song that incorporates digital drums, digital bass and synthpad sounds.

Generated Music:

uptempo-electronic.mp4

Prompt: A medium-tempo digital keyboard song with a jazzy backing track featuring digital drums, piano, e-bass, trumpet, and acoustic guitar.

Generated Music:

medium-tempo.mp4

Prompt: This low-quality folk song features groovy wooden percussion, bass, piano, and flute melodies, as well as sustained strings and shimmering shakers that create a passionate, happy, and joyful atmosphere.

Generated Music:

low-quality-folk.mp4

Multilingual Generation

We present three multilingual capabilities of Lumina-Next-2B.

Generating Images conditioned on Chinese poems:


Generating Images with multilignual prompts:



Generating Images with emojis:


⚙️ Diverse Configurations

We support diverse configurations, including text encoders, DiTs of different parameter sizes, inference methods, and VAE encoders. Additionally, we offer features such as 1D-RoPE, image enhancement, and more.


Contributors

📄 Citation

@article{gao2024lumina,
      title={Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers},
      author={Gao, Peng and Zhuo, Le and Lin, Ziyi and Liu, Dongyang and Du, Ruoyi and Luo, Xu and Qiu, Longtian and Zhang, Yuhang and others},
      journal={arXiv preprint arXiv:2405.05945},
      year={2024}
}