diff --git a/README_zh.md b/README_zh.md new file mode 100644 index 0000000..d75c2f0 --- /dev/null +++ b/README_zh.md @@ -0,0 +1,808 @@ + + +
+ +
+ +# Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding + + + +----- + +This repo contains PyTorch model definitions, pre-trained weights and inference/sampling code for our paper exploring Hunyuan-DiT. You can find more visualizations on our [project page](https://dit.hunyuan.tencent.com/). + +> [**Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding**](https://arxiv.org/abs/2405.08748)+ +
+ +### Multi-turn Text2Image Generation +Understanding natural language instructions and performing multi-turn interaction with users are important for a +text-to-image system. It can help build a dynamic and iterative creation process that bring the user’s idea into reality +step by step. In this section, we will detail how we empower Hunyuan-DiT with the ability to perform multi-round +conversations and image generation. We train MLLM to understand the multi-round user dialogue +and output the new text prompt for image generation. ++ +
+ +## 📈 Comparisons +In order to comprehensively compare the generation capabilities of HunyuanDiT and other models, we constructed a 4-dimensional test set, including Text-Image Consistency, Excluding AI Artifacts, Subject Clarity, Aesthetic. More than 50 professional evaluators performs the evaluation. + ++
Model | Open Source | Text-Image Consistency (%) | Excluding AI Artifacts (%) | Subject Clarity (%) | Aesthetics (%) | Overall (%) | +
---|---|---|---|---|---|---|
SDXL | ✔ | 64.3 | 60.6 | 91.1 | 76.3 | 42.7 | +
PixArt-α | ✔ | 68.3 | 60.9 | 93.2 | 77.5 | 45.5 | +
Playground 2.5 | ✔ | 71.9 | 70.8 | 94.9 | 83.3 | 54.3 | +
SD 3 | ✘ | 77.1 | 69.3 | 94.6 | 82.5 | 56.7 | + +
MidJourney v6 | ✘ | 73.5 | 80.2 | 93.5 | 87.2 | 63.3 | +
DALL-E 3 | ✘ | 83.9 | 80.3 | 96.5 | 89.4 | 71.0 | +
Hunyuan-DiT | ✔ | 74.2 | 74.3 | 95.4 | 86.6 | 59.0 | +
+ +
+ +* **Long Text Input** + + ++ +
+ +* **Multi-turn Text2Image Generation** + +https://github.com/Tencent/tencent.github.io/assets/27557933/94b4dcc3-104d-44e1-8bb2-dc55108763d1 + + + +--- + +## 📜 需求 + +该版本包括了 DialogGen (一种提示增强的模型)和 Hunyuan-DiT (一种文本到图像的模型)。 + +下表表明了运行模型的要求 (batch size = 1): + +| 模型 | --加载-4bit (DialogGen) | GPU最低显存 | GPU型号 | +|:-----------------------:|:-----------------------:|:---------------:|:---------------:| +| DialogGen + Hunyuan-DiT | ✘ | 32G | A100 | +| DialogGen + Hunyuan-DiT | ✔ | 22G | A100 | +| Hunyuan-DiT | - | 11G | A100 | +| Hunyuan-DiT | - | 14G | RTX3090/RTX4090 | + +*需要一个支持CUDA的英伟达GPU。 + * 我们在V100和A100的GPUs上进行测试。 + * **最低配置**: GPU最小显存应该达到11GB。 + * **推荐配置**: 我们推荐使用显存为32GB的GPU以获得更好的生成质量。 +* 测试使用的操作系统: Linux + +## 🛠️ 环境依赖与安装 + +首先克隆该仓库: +```shell +git clone https://github.com/tencent/HunyuanDiT +cd HunyuanDiT +``` + +### Linux系统的安装指南 + +我们提供了一个名为 `environment.yml`的文件来创造一个Conda环境。 +Conda的安装说明可以查阅[这里](https://docs.anaconda.com/free/miniconda/index.html). + +我们推荐CUDA的版本11.7或12.0+. + +```shell +# 1. 创建conda环境 +conda env create -f environment.yml + +# 2. 激活环境 +conda activate HunyuanDiT + +# 3. 安装环境依赖 +python -m pip install -r requirements.txt + +# 4. (可选)安装用于加速的 flash attention v2(需要CUDA11.6或者更高的版本) +python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.1.2.post3 +``` + +## 🧱 下载预训练模型 +要下载模型,首先要安装huggingface-cli。 (详细的说明见[此处](https://huggingface.co/docs/huggingface_hub/guides/cli)。) + +```shell +python -m pip install "huggingface_hub[cli]" +``` + +然后使用以下命令下载模型: + +```shell +# Create a directory named 'ckpts' where the model will be saved, fulfilling the prerequisites for running the demo. +mkdir ckpts +# Use the huggingface-cli tool to download the model. +# The download time may vary from 10 minutes to 1 hour depending on network conditions. +huggingface-cli download Tencent-Hunyuan/HunyuanDiT --local-dir ./ckpts +``` + +训练数据示例 | +|||
+ | + | + | + |
青花瓷风格,一只蓝色的鸟儿站在蓝色的花瓶上,周围点缀着白色花朵,背景是白色 (Porcelain style, a blue bird stands on a blue vase, surrounded by white flowers, with a white background. +) | +青花瓷风格,这是一幅蓝白相间的陶瓷盘子,上面描绘着一只狐狸和它的幼崽在森林中漫步,背景是白色 (Porcelain style, this is a blue and white ceramic plate depicting a fox and its cubs strolling in the forest, with a white background.) | +青花瓷风格,在黑色背景上,一只蓝色的狼站在蓝白相间的盘子上,周围是树木和月亮 (Porcelain style, on a black background, a blue wolf stands on a blue and white plate, surrounded by trees and the moon.) | +青花瓷风格,在蓝色背景上,一只蓝色蝴蝶和白色花朵被放置在中央 (Porcelain style, on a blue background, a blue butterfly and white flowers are placed in the center.) | +
推理结果示例 | +|||
+ | + | + | + |
青花瓷风格,苏州园林 (Porcelain style, Suzhou Gardens.) | +青花瓷风格,一朵荷花 (Porcelain style, a lotus flower.) | +青花瓷风格,一只羊(Porcelain style, a sheep.) | +青花瓷风格,一个女孩在雨中跳舞(Porcelain style, a girl dancing in the rain.) | +
Condition Input | +||
Canny ControlNet | +Depth ControlNet | +Pose ControlNet | +
在夜晚的酒店门前,一座古老的中国风格的狮子雕像矗立着,它的眼睛闪烁着光芒,仿佛在守护着这座建筑。背景是夜晚的酒店前,构图方式是特写,平视,居中构图。这张照片呈现了真实摄影风格,蕴含了中国雕塑文化,同时展现了神秘氛围 (At night, an ancient Chinese-style lion statue stands in front of the hotel, its eyes gleaming as if guarding the building. The background is the hotel entrance at night, with a close-up, eye-level, and centered composition. This photo presents a realistic photographic style, embodies Chinese sculpture culture, and reveals a mysterious atmosphere.) |
+ 在茂密的森林中,一只黑白相间的熊猫静静地坐在绿树红花中,周围是山川和海洋。背景是白天的森林,光线充足 (In the dense forest, a black and white panda sits quietly in green trees and red flowers, surrounded by mountains, rivers, and the ocean. The background is the forest in a bright environment.) |
+ 一位亚洲女性,身穿绿色上衣,戴着紫色头巾和紫色围巾,站在黑板前。背景是黑板。照片采用近景、平视和居中构图的方式呈现真实摄影风格 (An Asian woman, dressed in a green top, wearing a purple headscarf and a purple scarf, stands in front of a blackboard. The background is the blackboard. The photo is presented in a close-up, eye-level, and centered composition, adopting a realistic photographic style) |
+
+ | + | + + |
ControlNet Output | +||
+ | + | + |