From 84dbc783871711d7aa989b45c547151aaaf5e17c Mon Sep 17 00:00:00 2001
From: Daoyuan Chen <67475544+yxdyc@users.noreply.github.com>
Date: Wed, 22 Jan 2025 20:13:51 +0800
Subject: [PATCH] Refactor and improve doc for RecipeGallery, DeveloperGuide,
 DistributedProcess and DJ-related Competitions (#561)

* 1. refactor doc for RecipeGallery;
2. improve the doc for developer guide
3. some typo fix, and suitable overview fig size;
4. add link to the added data resplit tool

* add use cases for DJ related competitions

* in use case, add agentscope

* remove []

* unify commas

* fix TOC rendering error

* fix spaces and en version

* fix bad link

* suitable overview fig size in homepage
---
 README.md                                |  62 ++++---
 README_ZH.md                             |  59 ++++---
 configs/README.md                        |  32 ----
 configs/README_ZH.md                     |  33 ----
 configs/data_juicer_recipes/README.md    |  74 --------
 configs/data_juicer_recipes/README_ZH.md |  76 --------
 docs/DeveloperGuide.md                   | 215 ++++++++++++-----------
 docs/DeveloperGuide_ZH.md                | 213 +++++++++++-----------
 docs/Distributed.md                      |   5 +-
 docs/Distributed_ZH.md                   |   5 +-
 docs/RecipeGallery.md                    | 128 ++++++++++++++
 docs/RecipeGallery_ZH.md                 | 127 +++++++++++++
 12 files changed, 564 insertions(+), 465 deletions(-)
 delete mode 100644 configs/README.md
 delete mode 100644 configs/README_ZH.md
 delete mode 100644 configs/data_juicer_recipes/README.md
 delete mode 100644 configs/data_juicer_recipes/README_ZH.md
 create mode 100644 docs/RecipeGallery.md
 create mode 100644 docs/RecipeGallery_ZH.md

diff --git a/README.md b/README.md
index 2dcf55fed..da8fff0b0 100644
--- a/README.md
+++ b/README.md
@@ -28,7 +28,7 @@
 
 
 Data-Juicer is a one-stop system to process text and multimodal data for and with foundation models (typically LLMs).
-We provide a [playground](http://8.138.149.181/) with a managed JupyterLab. [Try Data-Juicer](http://8.138.149.181/) straight away in your browser! If you find Data-Juicer useful for your research or development, please kindly support us by starting it (then be instantly notified of our new releases) and citing our [work](#references).
+We provide a [playground](http://8.138.149.181/) with a managed JupyterLab. [Try Data-Juicer](http://8.138.149.181/) straight away in your browser! If you find Data-Juicer useful for your research or development, please kindly support us by starting it (then be instantly notified of our new releases) and citing our [works](#references).
 
 [Platform for AI of Alibaba Cloud (PAI)](https://www.aliyun.com/product/bigdata/learn) has cited our work and integrated Data-Juicer into its data processing products. PAI is an AI Native large model and AIGC engineering platform that provides dataset management, computing power management, model tool chain, model development, model training, model deployment, and AI asset management. For documentation on data processing, please refer to: [PAI-Data Processing for Large Models](https://help.aliyun.com/zh/pai/user-guide/components-related-to-data-processing-for-foundation-models/?spm=a2c4g.11186623.0.0.3e9821a69kWdvX).
 
@@ -101,7 +101,7 @@ Table of Contents
 
 ## Why Data-Juicer?
 
-![Overview](https://img.alicdn.com/imgextra/i2/O1CN01EteoQ31taUweAW1UE_!!6000000005918-2-tps-4034-4146.png)
+<img src="https://img.alicdn.com/imgextra/i2/O1CN01EteoQ31taUweAW1UE_!!6000000005918-2-tps-4034-4146.png" align="center" width="600" />
 
 - **Systematic & Reusable**:
   Empowering users with a systematic library of 100+ core [OPs](docs/Operators.md), and 50+ reusable config recipes and 
@@ -121,35 +121,51 @@ Table of Contents
 
 
 ## DJ-Cookbook
+
 ### Curated Resources
 - [KDD-Tutorial](https://modelscope.github.io/data-juicer/_static/tutorial_kdd24.html)
 - [Awesome LLM-Data](docs/awesome_llm_data.md)
 - ["Bad" Data Exhibition](docs/BadDataExhibition.md)
 
+
 ### Coding with Data-Juicer (DJ)
-- [Overview of DJ](README.md)
-- [Operator Zoo](docs/Operators.md)
-- [Quick Start](#quick-start)
-- [Configuration](configs/README.md)
-- [Developer Guide](docs/DeveloperGuide.md)
-- [API references](https://modelscope.github.io/data-juicer/)
-- [Preprocess Tools](tools/preprocess/README.md)
-- [Postprocess Tools](tools/postprocess/README.md)
-- [Format Conversion](tools/fmt_conversion/README.md)
-- [Sandbox](docs/Sandbox.md)
-- [Quality Classifier](tools/quality_classifier/README.md)
-- [Auto Evaluation](tools/evaluator/README.md)
-- [Third-parties Integration](thirdparty/LLM_ecosystems/README.md)
+- Basics
+  - [Overview of DJ](README.md)
+  - [Quick Start](#quick-start)
+  - [Configuration](docs/RecipeGallery.md)
+  - [Data Format Conversion](tools/fmt_conversion/README.md)
+- Lookup Materials
+  - [DJ OperatorZoo](docs/Operators.md)
+  - [API references](https://modelscope.github.io/data-juicer/)
+- Advanced
+  - [Developer Guide](docs/DeveloperGuide.md)
+  - [Preprocess Tools](tools/preprocess/README.md)
+  - [Postprocess Tools](tools/postprocess/README.md)
+  - [Sandbox](docs/Sandbox.md)
+  - [Quality Classifier](tools/quality_classifier/README.md)
+  - [Auto Evaluation](tools/evaluator/README.md)
+  - [Third-parties Integration](thirdparty/LLM_ecosystems/README.md)
+
 
 ### Use Cases & Data Recipes
-- [Recipes for data process in BLOOM](configs/reproduced_bloom/README.md)
-- [Recipes for data process in RedPajama](configs/reproduced_redpajama/README.md)
-- [Refined recipes for pre-training text data](configs/data_juicer_recipes/README.md)
-- [Refined recipes for fine-tuning text data](configs/data_juicer_recipes/README.md#before-and-after-refining-for-alpaca-cot-dataset)
-- [Refined recipes for pre-training multi-modal data](configs/data_juicer_recipes/README.md#before-and-after-refining-for-multimodal-dataset)
+- [Data Recipe Gallery](docs/RecipeGallery.md)
+  - Data-Juicer Minimal Example Recipe
+  - Reproducing Open Source Text Datasets
+  - Improving Open Source Text Pre-training Datasets
+  - Improving Open Source Text Post-processing Datasets
+  - Synthetic Contrastive Learning Image-text Datasets
+  - Improving Open Source Image-text Datasets
+  - Basic Example Recipes for Video Data
+  - Synthesizing Human-centered Video Evaluation Sets
+  - Improving Existing Open Source Video Datasets
+- Data-Juicer related Competitions
+  - [Better Synth](https://tianchi.aliyun.com/competition/entrance/532251), explore the impact of large model synthetic data on image understanding ability with DJ-Sandbox Lab and multimodal large models
+  - [Modelscope-Sora Challenge](https://tianchi.aliyun.com/competition/entrance/532219), based on Data-Juicer and [EasyAnimate](https://github.com/aigc-apps/EasyAnimate) framework,  optimize data and train SORA-like small models to generate better videos
+  - [Better Mixture](https://tianchi.aliyun.com/competition/entrance/532174), only adjust data mixing and sampling strategies for given multiple candidate datasets
+  - FT-Data Ranker ([1B Track](https://tianchi.aliyun.com/competition/entrance/532157), [7B Track](https://tianchi.aliyun.com/competition/entrance/532158)), For a specified candidate dataset, only adjust the data filtering and enhancement strategies
+  - [Kolors-LoRA Stylized Story Challenge](https://tianchi.aliyun.com/competition/entrance/532254), based on Data-Juicer and [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio) framework, explore Difussion model fine-tuning
 - [DJ-SORA](docs/DJ_SORA.md)
-- [Agentic Filters of DJ](./demos/api_service/react_data_filter_process.ipynb)
-- [Agentic Mappers of DJ](./demos/api_service/react_data_mapper_process.ipynb)
+- Based on Data-Juicer and [AgentScope](https://github.com/modelscope/agentscope) framework, leverage [agents to call DJ Filters](./demos/api_service/react_data_filter_process.ipynb) and [call DJ Mappers](./demos/api_service/react_data_mapper_process.ipynb)
 
 
 ### Interactive Examples
@@ -466,7 +482,7 @@ features, bug fixes, and better documentation. Please refer to
 Data-Juicer is used across various foundation model applications and research initiatives, such as industrial scenarios in Alibaba Tongyi and Alibaba Cloud's platform for AI (PAI).
 We look forward to more of your experience, suggestions, and discussions for collaboration!
 
-Data-Juicer thanks many community [contributers](https://github.com/modelscope/data-juicer/graphs/contributors) and open-source projects, such as
+Data-Juicer thanks many community [contributors](https://github.com/modelscope/data-juicer/graphs/contributors) and open-source projects, such as
 [Huggingface-Datasets](https://github.com/huggingface/datasets), [Bloom](https://huggingface.co/bigscience/bloom), [RedPajama](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1), [Arrow](https://github.com/apache/arrow), [Ray](https://github.com/ray-project/ray), ....
 
 
diff --git a/README_ZH.md b/README_ZH.md
index 427007dbd..8634a9b5d 100644
--- a/README_ZH.md
+++ b/README_ZH.md
@@ -94,7 +94,7 @@ Data-Juicer正在积极更新和维护中,我们将定期强化和新增更多
 
 ## 为什么选择 Data-Juicer?
 
-![概述](https://img.alicdn.com/imgextra/i2/O1CN01EteoQ31taUweAW1UE_!!6000000005918-2-tps-4034-4146.png)
+<img src="https://img.alicdn.com/imgextra/i2/O1CN01EteoQ31taUweAW1UE_!!6000000005918-2-tps-4034-4146.png" align="center" width="600" />
 
 - **系统化和可重用**:
 系统化地为用户提供 100 多个核心 [算子](docs/Operators.md) 和 50 多个可重用的数据菜谱和
@@ -116,30 +116,43 @@ Data-Juicer正在积极更新和维护中,我们将定期强化和新增更多
 - [“坏”数据展览](docs/BadDataExhibition_ZH.md)
 
 ### 编写Data-Juicer (DJ) 代码
-- [DJ概览](README_ZH.md)
-- [算子库](docs/Operators.md)
-- [快速上手](#快速上手)
-- [配置](configs/README_ZH.md)
-- [开发者指南](docs/DeveloperGuide_ZH.md)
-- [API参考](https://modelscope.github.io/data-juicer/)
-- [预处理工具](tools/preprocess/README_ZH.md)
-- [后处理工具](tools/postprocess/README_ZH.md)
-- [格式转换](tools/fmt_conversion/README_ZH.md)
-- [沙盒](docs/Sandbox-ZH.md)
-- [质量分类器](tools/quality_classifier/README_ZH.md)
-- [自动评估](tools/evaluator/README_ZH.md)
-- [第三方集成](thirdparty/LLM_ecosystems/README_ZH.md)
+- 基础
+  - [DJ概览](README_ZH.md)
+  - [快速上手](#快速上手)
+  - [配置](docs/RecipeGallery_ZH.md)
+  - [数据格式转换](tools/fmt_conversion/README_ZH.md)
+- 信息速查
+  - [算子库](docs/Operators.md)
+  - [API参考](https://modelscope.github.io/data-juicer/)
+- 进阶
+  - [开发者指南](docs/DeveloperGuide_ZH.md)
+  - [预处理工具](tools/preprocess/README_ZH.md)
+  - [后处理工具](tools/postprocess/README_ZH.md)
+  - [沙盒](docs/Sandbox-ZH.md)
+  - [质量分类器](tools/quality_classifier/README_ZH.md)
+  - [自动评估](tools/evaluator/README_ZH.md)
+  - [第三方集成](thirdparty/LLM_ecosystems/README_ZH.md)
 
 ### 用例与数据菜谱
-
-* [BLOOM 数据处理菜谱](configs/reproduced_bloom/README_ZH.md)
-* [RedPajama 数据处理菜谱](configs/reproduced_redpajama/README_ZH.md)
-* [预训练文本数据增强菜谱](configs/data_juicer_recipes/README_ZH.md)
-* [Fine-tuning文本数据增强菜谱](configs/data_juicer_recipes/README_ZH.md#完善前后的alpaca-cot数据集)
-* [预训练多模态数据增强菜谱](configs/data_juicer_recipes/README_ZH.md#before-and-after-refining-for-multimodal-dataset)
+* [数据菜谱Gallery](docs/RecipeGallery.md)
+  - Data-Juicer 最小示例配方
+  - 复现开源文本数据集
+  - 改进开源文本预训练数据集
+  - 改进开源文本后处理数据集
+  - 合成对比学习图像文本数据集
+  - 改进开源图像文本数据集
+  - 视频数据的基本示例菜谱
+  - 合成以人为中心的视频评测集
+  - 改进现有的开源视频数据集
+* Data-Juicer相关竞赛
+  - [Better Synth](https://tianchi.aliyun.com/competition/entrance/532251),在DJ-沙盒实验室和多模态大模型上,探索大模型合成数据对图像理解能力的影响
+  - [Modelscope-Sora挑战赛](https://tianchi.aliyun.com/competition/entrance/532219),基于Data-Juicer和[EasyAnimate](https://github.com/aigc-apps/EasyAnimate)框架,调优文本-视频数据集,在类SORA小模型上训练以生成更好的视频
+  - [Better Mixture](https://tianchi.aliyun.com/competition/entrance/532174),针对指定多个候选数据集,仅调整数据混合和采样策略
+  - FT-Data Ranker ([1B Track](https://tianchi.aliyun.com/competition/entrance/532157)、 [7B Track](https://tianchi.aliyun.com/competition/entrance/532158)),针对指定候选数据集,仅调整数据过滤和增强策略
+  - [可图Kolors-LoRA风格故事挑战赛](https://tianchi.aliyun.com/competition/entrance/532254),基于Data-Juicer和[DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio)框架,探索Difussion模型微调
 * [DJ-SORA](docs/DJ_SORA_ZH.md)
-* [智能体调用DJ Filters](./demos/api_service/react_data_filter_process.ipynb)
-* [智能体调用DJ Mappers](./demos/api_service/react_data_mapper_process.ipynb)
+* 基于Data-Juicer和[AgentScope](https://github.com/modelscope/agentscope)框架,通过[智能体调用DJ Filters](./demos/api_service/react_data_filter_process.ipynb)和[调用DJ Mappers](./demos/api_service/react_data_mapper_process.ipynb)
+  
 
 
 ### 交互类示例
@@ -449,7 +462,7 @@ Data-Juicer 在 Apache License 2.0 协议下发布。
 
 ## 致谢
 
-Data-Juicer被许多大模型相关产品和研究工作所使用,例子阿里巴巴通义和阿里云人工智能平台 (PAI) 之上的工业界场景。 我们期待更多您的体验反馈、建议和合作共建!
+Data-Juicer被许多大模型相关产品和研究工作所使用,例如阿里巴巴通义和阿里云人工智能平台 (PAI) 之上的工业界场景。 我们期待更多您的体验反馈、建议和合作共建!
 
 
 Data-Juicer 感谢社区[贡献者](https://github.com/modelscope/data-juicer/graphs/contributors) 和相关的先驱开源项目,譬如[Huggingface-Datasets](https://github.com/huggingface/datasets), [Bloom](https://huggingface.co/bigscience/bloom), [RedPajama](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1), [Arrow](https://github.com/apache/arrow), [Ray](https://github.com/ray-project/ray), ....
diff --git a/configs/README.md b/configs/README.md
deleted file mode 100644
index dd4dba3cb..000000000
--- a/configs/README.md
+++ /dev/null
@@ -1,32 +0,0 @@
-# Config Files
-
-This folder contains some configuration files to allow users to easily understand the configuration methods of various functions and quickly reproduce the processing flow of different datasets.
-
-## Usage
-
-```shell
-# To process your dataset.
-python tools/process_data.py --config xxx.yaml
-# To analyze your dataset.
-python tools/analyze_data.py --config xxx.yaml
-```
-
-## Categories
-
-The current configuration files are classified into the subsequent categories.
-
-### Demo
-
-Demo configuration files are used to help users quickly familiarize the basic functions of Data-Juicer. Please refer to the [demo](demo) folder for details.
-
-
-### Reproduced Redpajama
-
-We have reproduced the processing flow of some RedPajama datasets. Please refer to the [reproduced_redpajama](reproduced_redpajama) folder for details.
-
-### Reproduced BLOOM
-
-We have reproduced the processing flow of some BLOOM datasets. please refer to the [reproduced_bloom](reproduced_bloom) folder for details.
-
-### Data-Juicer Recipes
-We have refined some open source datasets (including CFT datasets) by using Data-Juicer and have provided configuration files for the refined flow. please refer to the [data_juicer_recipes](data_juicer_recipes) folder for details.
diff --git a/configs/README_ZH.md b/configs/README_ZH.md
deleted file mode 100644
index 041f565a2..000000000
--- a/configs/README_ZH.md
+++ /dev/null
@@ -1,33 +0,0 @@
-# 配置文件
-
-此文件夹包含一些配置文件,帮助用户轻松理解各种功能的配置方法,并快速复现开源数据集的处理流程。
-
-## 用法
-
-```shell
-#处理数据集
-python tools/process_data.py --config xxx.yaml
-
-#分析数据集
-python tools/analyze_data.py --config xxx.yaml
-```
-
-## 分类
-
-配置文件分为以下几类。
-
-### Demo
-
-Demo 配置文件用于帮助用户快速熟悉 Data-Juicer 的基本功能,请参阅 [demo](demo) 文件夹以获取详细说明。
-
-
-### 复现的Redpajama
-
-我们已经复现了部分 Redpajama 数据集的处理流程,请参阅 [reproduced_redpajama](reproduced_redpajama) 文件夹以获取详细说明。
-
-### 复现的BLOOM
-
-我们已经重现了部分 BLOOM 数据集的处理流程,请参阅 [reproduced_bloom](reproduced_bloom) 文件夹以获取详细说明。
-
-### Data-Juicer 菜谱
-我们使用 Data-Juicer 更细致地处理了一些开源数据集(包含 CFT 数据集),并提供了处理流程的配置文件。请参阅 [data_juicer_recipes](data_juicer_recipes) 文件夹以获取详细说明。
diff --git a/configs/data_juicer_recipes/README.md b/configs/data_juicer_recipes/README.md
deleted file mode 100644
index 14febd24f..000000000
--- a/configs/data_juicer_recipes/README.md
+++ /dev/null
@@ -1,74 +0,0 @@
-# Refined open source dataset by Data-Juicer
-
-We found that there are still some "bad" samples in existing processed datasets (e.g. RedPajama, The Pile.). So we use our Data-Juicer to refine them and try to feed them to LLMs for better performance.
-
-We use simple 3-σ rule to set the hyperparameters for ops in each recipe.
-
-## Before and after refining for Pretraining Text Dataset
-
-| subset               |       #samples before       | #samples after | keep ratio | config link                                                                                                                                                                                                                        | data link                                                                                                                                                                                                                                                                                  | source                  |
-|----------------------|:---------------------------:|:--------------:|:----------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------|
-| arXiv                |          1,724,497          |   1,655,259    |   95.99%   | [redpajama-arxiv-refine.yaml](redpajama-arxiv-refine.yaml)                                                                                                                                                                         | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-arxiv-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-arxiv-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-arxiv-refined-by-data-juicer)                                        | Redpajama               |
-| Books                |           205,182           |    195,983     |   95.51%   | [redpajama-book-refine.yaml](redpajama-book-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-book-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-book-refined-by-data-juicer/summary)   <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-book-refined-by-data-juicer)                                        | Redpajama               |
-| Wikipedia            |         29,834,171          |   26,990,659   |   90.47%   | [redpajama-wiki-refine.yaml](redpajama-wiki-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-wiki-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-wiki-refined-by-data-juicer/summary)   <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-wiki-refined-by-data-juicer)                                        | Redpajama               |
-| C4                   |         364,868,892         |  344,491,171   |   94.42%   | [redpajama-c4-refine.yaml](redpajama-c4-refine.yaml)                                                                                                                                                                               | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-c4-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-c4-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-c4-refined-by-data-juicer)                                             | Redpajama               |
-| Common Crawl 2019-30 |         81,085,420          |   36,557,283   |   45.08%   | [redpajama-cc-2019-30-refine.yaml](redpajama-cc-2019-30-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2019-30-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2019-30-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2019-30-refined-by-data-juicer)  | Redpajama               |
-| Common Crawl 2020-05 |         90,850,492          |   42,612,596   |   46.90%   | [redpajama-cc-2020-05-refine.yaml](redpajama-cc-2020-05-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2020-05-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2020-05-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2020-05-refined-by-data-juicer)  | Redpajama               |
-| Common Crawl 2021-04 |         98,878,523          |   44,724,752   |   45.23%   | [redpajama-cc-2021-04-refine.yaml](redpajama-cc-2021-04-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2021-04-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2021-04-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2021-04-refined-by-data-juicer)  | Redpajama               |
-| Common Crawl 2022-05 |         94,058,868          |   42,648,496   |   45.34%   | [redpajama-cc-2022-05-refine.yaml](redpajama-cc-2022-05-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2022-05-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2022-05-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2022-05-refined-by-data-juicer)  | Redpajama               |
-| Common Crawl 2023-06 |         111,402,716         |   50,643,699   |   45.46%   | [redpajama-cc-2023-06-refine.yaml](redpajama-cc-2023-06-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2023-06-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2023-06-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2023-06-refined-by-data-juicer) | Redpajama               |
-| Github Code          | 73,208,524 <br>+ 21,387,703 |   49,279,344   |   52.09%   | [redpajama-code-refine.yaml](github_code/redpajama-code-refine.yaml)<br>[stack-code-refine.yaml](github_code/stack-code-refine.yaml)<br>[redpajama-stack-code-deduplicate.yaml](github_code/redpajama-stack-code-deduplicate.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-stack-code-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-stack-code-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-stack-code-refined-by-data-juicer)                             | Redpajama<br>The Stack  |
-| StackExchange        |         45,447,328          |   26,309,203   |   57.89%   | [redpajama-pile-stackexchange-refine.yaml](redpajama-pile-stackexchange-refine.yaml)                                                                                                                                               | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-pile-stackexchange-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-pile-stackexchange-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-pile-stackexchange-refined-by-data-juicer)             | Redpajama<br>The Pile   |
-| EuroParl             |           69,814            |     61,601     |   88.23%   | [pile-europarl-refine.yaml](pile-europarl-refine.yaml)                                                                                                                                                                             | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-europarl-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-europarl-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-europarl-refined-by-data-juicer)                                   | The Pile                |
-| FreeLaw              |          3,562,015          |   2,942,612    |   82.61%   | [pile-freelaw-refine.yaml](pile-freelaw-refine.yaml)                                                                                                                                                                               | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-freelaw-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-freelaw-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-freelaw-refined-by-data-juicer)                                     | The Pile                |
-| HackerNews           |           373,027           |    371,331     |   99.55%   | [pile-hackernews-refine.yaml](pile-hackernews-refine.yaml)                                                                                                                                                                         | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-hackernews-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-hackernews-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-hackernews-refined-by-data-juicer)                               | The Pile                |
-| NIH ExPorter         |           939,661           |    858,492     |   91.36%   | [pile-nih-refine.yaml](pile-nih-refine.yaml)                                                                                                                                                                                       | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-hin-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-nih-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-nih-refined-by-data-juicer)                                             | The Pile                |
-| PhilPapers           |           32,782            |     29,117     |   88.82%   | [pile-philpaper-refine.yaml](pile-philpaper-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-philpaper-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-philpaper-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-philpaper-refined-by-data-juicer)                                 | The Pile                |
-| PubMed Abstracts     |         15,518,009          |   15,009,325   |   96.72%   | [pile-pubmed-abstract-refine.yaml](pile-pubmed-abstract-refine.yaml)                                                                                                                                                               | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-pubmed-abstract-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-pubmed-abstracts-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-pubmed-abstracts-refined-by-data-juicer)                    | The Pile                |
-| PubMed Central       |          3,098,930          |   2,694,860    |   86.96%   | [pile-pubmed-central-refine.yaml](pile-pubmed-central-refine.yaml)                                                                                                                                                                 | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-pubmed-central-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-pubmed-central-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-pubmed-central-refined-by-data-juicer)                        | The Pile                |
-| USPTO                |          5,883,024          |   4,516,283    |   76.77%   | [pile-uspto-refine.yaml](pile-uspto-refine.yaml)                                                                                                                                                                                   | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-uspto-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-uspto-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-uspto-refined-by-data-juicer) | The Pile                |
-
-
-## Before and after refining for Alpaca-CoT Dataset
-
-| subset | #samples before     |             #samples after             | keep ratio | config link                                                                                                                                                                                                                        | data link                                                                                                                                                         | source                 |
-|------------------|:-------------------------:|:--------------------------------------:|:----------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------|
-| Alpaca-Cot EN | 136,219,879               | 72,855,345 |   54.48%   | [alpaca-cot-en-refine.yaml](alpaca_cot/alpaca-cot-en-refine.yaml)                                                                                                                                                                         | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-en-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-en-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-en-refined-by-data-juicer)                          | [39 Subsets of Alpaca-CoT](alpaca_cot/README.md#refined-alpaca-cot-dataset-meta-info)              |
-| Alpaca-Cot ZH | 21,197,246               |               9,873,214                |   46.58%   | [alpaca-cot-zh-refine.yaml](alpaca_cot/alpaca-cot-zh-refine.yaml)                                                                                                                                                                         | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-zh-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-zh-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-zh-refined-by-data-juicer)                          | [28 Subsets of Alpaca-CoT](alpaca_cot/README.md#refined-alpaca-cot-dataset-meta-info)              |
-
-## Before and after refining for Multimodal Dataset
-
-| subset                    |       #samples before       | #samples after | keep ratio | config link                          | data link                                                                                                                                                                                                                                                                                 | source        |
-|---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
-| LLaVA pretrain (LCS-558k) |          558,128          |   500,380    |   89.65%   | [llava-pretrain-refine.yaml](llava-pretrain-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/LLaVA-1.5/public/llava-pretrain-refine-result.json) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/llava-pretrain-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/llava-pretrain-refined-by-data-juicer)                                        | [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) |
-| Data-Juicer (T2V, 147k) |          1,217,346          |   147,176    |   12.09%   | [data-juicer-sandbox-optimal.yaml](data-juicer-sandbox-optimal.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool)                                        | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (605k)](https://github.com/snap-research/Panda-70M) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
-| Data-Juicer (DJ, 228k) |          3,408,553          |   227,867    |   8.15%   | [data-juicer-sandbox-self-evolution.yaml](data-juicer-sandbox-self-evolution.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool_s2.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-evolution-data-pool)                                        | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (2,599k)](https://github.com/snap-research/Panda-70M) <br> [Pexels (198k)](https://github.com/cj-mills/pexels-dataset) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
-
-### Evaluation Results
-- LLaVA pretrain (LCS-558k): models **pretrained with refined dataset** and fine-tuned with the original instruct dataset outperforms the baseline (LLaVA-1.5-13B) on 10 out of 12 benchmarks.
-
-| model                         | VQAv2 | GQA | VizWiz | SQA | TextVQA | POPE | MME | MM-Bench | MM-Bench-CN | SEED | LLaVA-Bench-Wild | MM-Vet |
-|-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
-| LLaVA-1.5-13B <br> (baseline) | **80.0**  | 63.3 | 53.6 | 71.6 | **61.3** | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 |
-| LLaVA-1.5-13B <br> (refined pretrain dataset) | 79.94 | **63.5** | **54.09** | **74.20** | 60.82 | **86.67** | **1565.53** | **68.2** | **63.9** | **61.8** | **75.9** | **37.4** |
-- Data-Juicer (T2V, 147k) and Data-Juicer (DJ, 228k): models **trained with refined dataset** outperforms the baseline ([T2V-Turbo](https://github.com/Ji4chenLi/t2v-turbo)) on [VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard). T2V-Turbo is the teacher model of Data-Juicer (T2V, 147k) and Data-Juicer (T2V, 147k) is the teacher model of Data-Juicer (DJ, 228k). Please refer to [Sandbox](../../docs/Sandbox.md) for more detail.
-
-| model                         | Total Score | Quality Score | Semantic Score | subject consistency | background consistency | temporal flickering | motion smoothness | dynamic degree | aesthetic quality |
-|-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- |
-| T2V-Turbo               | 81.01 | 82.57 | 74.76 | 96.28 | 97.02 | 97.48 | 97.34 | 49.17 | 63.04 |
-| Data-Juicer (T2V, 147k) | 82.10 | 83.14 | 77.93 | 97.32 | 99.03 | 96.60 | 96.51 | **51.67** | **68.92** |
-| Data-Juicer (DJ, 228k)  | **82.53** | **83.38** | **79.13** | **97.92** | **99.27** | **98.14** | **97.77** | 38.89 | 67.39 |
-
-| model                         | imaging quality | object class | multiple objects | human action | color | spatial relationship | scene | appearance style | temporal style | overall consistency |
-|-------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
-| T2V-Turbo               | **72.49** | 93.96 | 54.65 | 95.20 | 89.90 | 38.67 | 55.58 | 24.42 | 25.51 | 28.16 |
-| Data-Juicer (T2V, 147k) | 70.42 | 95.85 | 61.63 | **95.60** | 94.06 | 46.95 | **57.57** | 24.42 | 26.34 | 28.90 |
-| Data-Juicer (DJ, 228k)  | 70.41 | **96.44** | **64.51** | 95.40 | **95.51** | **47.17** | 57.30 | **25.55** | **26.82** | **29.25** |
-
-## For Video Dataset
-
-We provide a video dataset processing recipe example for users to better utilize video-related OPs in [general-video-refine-example.yaml](general-video-refine-example.yaml). Here we apply three types of OPs:
-- Text-Only: to improve the dataset quality according to the video captions.
-- Video-Only: to improve the dataset quality according to the video features.
-- Text-Video: to improve the dataset quality according to the alignment between text and videos.
-Users can start to process their video datasets based on this recipe.
diff --git a/configs/data_juicer_recipes/README_ZH.md b/configs/data_juicer_recipes/README_ZH.md
deleted file mode 100644
index f6767dc60..000000000
--- a/configs/data_juicer_recipes/README_ZH.md
+++ /dev/null
@@ -1,76 +0,0 @@
-# 使用Data-Juicer完善开源数据集
-
-我们发现在现有的已经处理过的数据集(如 Redpajama、The Pile 等)中仍然存在一些“脏”数据样本。所以我们使用我们的 Data-Juicer 来完善这些数据集,并尝试将它们提供给 LLM 以获得更好的性能。
-
-我们使用简单的 3-σ 规则来设置每个数据处理菜谱中的算子的超参数。
-
-## 完善前后的预训练数据集
-
-| 数据子集                 |          完善前的样本数目           |    完善后的样本数目    |   样本保留率   | 配置链接                                                                                                                                                                                                                                | 数据链接                                                                                                                                                                                                                                                                                       | 来源                       |
-|----------------------|:---------------------------:|:--------------:|:---------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------|
-| arXiv                |          1,724,497          |   1,655,259    |   95.99%   | [redpajama-arxiv-refine.yaml](redpajama-arxiv-refine.yaml)                                                                                                                                                                         | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-arxiv-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-arxiv-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-arxiv-refined-by-data-juicer)                                        | Redpajama               |
-| Books                |           205,182           |    195,983     |   95.51%   | [redpajama-book-refine.yaml](redpajama-book-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-book-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-book-refined-by-data-juicer/summary)   <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-book-refined-by-data-juicer)                                        | Redpajama               |
-| Wikipedia            |         29,834,171          |   26,990,659   |   90.47%   | [redpajama-wiki-refine.yaml](redpajama-wiki-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-wiki-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-wiki-refined-by-data-juicer/summary)   <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-wiki-refined-by-data-juicer)                                        | Redpajama               |
-| C4                   |         364,868,892         |  344,491,171   |   94.42%   | [redpajama-c4-refine.yaml](redpajama-c4-refine.yaml)                                                                                                                                                                               | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-c4-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-c4-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-c4-refined-by-data-juicer)                                             | Redpajama               |
-| Common Crawl 2019-30 |         81,085,420          |   36,557,283   |   45.08%   | [redpajama-cc-2019-30-refine.yaml](redpajama-cc-2019-30-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2019-30-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2019-30-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2019-30-refined-by-data-juicer)  | Redpajama               |
-| Common Crawl 2020-05 |         90,850,492          |   42,612,596   |   46.90%   | [redpajama-cc-2020-05-refine.yaml](redpajama-cc-2020-05-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2020-05-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2020-05-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2020-05-refined-by-data-juicer)  | Redpajama               |
-| Common Crawl 2021-04 |         98,878,523          |   44,724,752   |   45.23%   | [redpajama-cc-2021-04-refine.yaml](redpajama-cc-2021-04-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2021-04-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2021-04-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2021-04-refined-by-data-juicer)  | Redpajama               |
-| Common Crawl 2022-05 |         94,058,868          |   42,648,496   |   45.34%   | [redpajama-cc-2022-05-refine.yaml](redpajama-cc-2022-05-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2022-05-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2022-05-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2022-05-refined-by-data-juicer)  | Redpajama               |
-| Common Crawl 2023-06 |         111,402,716         |   50,643,699   |   45.46%   | [redpajama-cc-2023-06-refine.yaml](redpajama-cc-2023-06-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2023-06-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2023-06-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2023-06-refined-by-data-juicer) | Redpajama               |
-| Github Code          | 73,208,524 <br>+ 21,387,703 |   49,279,344   |   52.09%   | [redpajama-code-refine.yaml](github_code/redpajama-code-refine.yaml)<br>[stack-code-refine.yaml](github_code/stack-code-refine.yaml)<br>[redpajama-stack-code-deduplicate.yaml](github_code/redpajama-stack-code-deduplicate.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-stack-code-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-stack-code-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-stack-code-refined-by-data-juicer)                             | Redpajama<br>The Stack  |
-| StackExchange        |         45,447,328          |   26,309,203   |   57.89%   | [redpajama-pile-stackexchange-refine.yaml](redpajama-pile-stackexchange-refine.yaml)                                                                                                                                               | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-pile-stackexchange-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-pile-stackexchange-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-pile-stackexchange-refined-by-data-juicer)             | Redpajama<br>The Pile   |
-| EuroParl             |           69,814            |     61,601     |   88.23%   | [pile-europarl-refine.yaml](pile-europarl-refine.yaml)                                                                                                                                                                             | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-europarl-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-europarl-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-europarl-refined-by-data-juicer)                                   | The Pile                |
-| FreeLaw              |          3,562,015          |   2,942,612    |   82.61%   | [pile-freelaw-refine.yaml](pile-freelaw-refine.yaml)                                                                                                                                                                               | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-freelaw-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-freelaw-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-freelaw-refined-by-data-juicer)                                     | The Pile                |
-| HackerNews           |           373,027           |    371,331     |   99.55%   | [pile-hackernews-refine.yaml](pile-hackernews-refine.yaml)                                                                                                                                                                         | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-hackernews-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-hackernews-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-hackernews-refined-by-data-juicer)                               | The Pile                |
-| NIH ExPorter         |           939,661           |    858,492     |   91.36%   | [pile-nih-refine.yaml](pile-nih-refine.yaml)                                                                                                                                                                                       | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-hin-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-nih-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-nih-refined-by-data-juicer)                                             | The Pile                |
-| PhilPapers           |           32,782            |     29,117     |   88.82%   | [pile-philpaper-refine.yaml](pile-philpaper-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-philpaper-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-philpaper-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-philpaper-refined-by-data-juicer)                                 | The Pile                |
-| PubMed Abstracts     |         15,518,009          |   15,009,325   |   96.72%   | [pile-pubmed-abstract-refine.yaml](pile-pubmed-abstract-refine.yaml)                                                                                                                                                               | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-pubmed-abstract-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-pubmed-abstracts-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-pubmed-abstracts-refined-by-data-juicer)                    | The Pile                |
-| PubMed Central       |          3,098,930          |   2,694,860    |   86.96%   | [pile-pubmed-central-refine.yaml](pile-pubmed-central-refine.yaml)                                                                                                                                                                 | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-pubmed-central-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-pubmed-central-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-pubmed-central-refined-by-data-juicer)                        | The Pile                |
-| USPTO                |          5,883,024          |   4,516,283    |   76.77%   | [pile-uspto-refine.yaml](pile-uspto-refine.yaml)                                                                                                                                                                                   | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-uspto-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-uspto-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-uspto-refined-by-data-juicer) | The Pile                |
-
-
-## 完善前后的Alpaca-CoT数据集
-
-| 数据子集              |         完善前的样本数目         |              完善后的样本数目              |   样本保留率   | 配置链接                                                                                                                                                                                                                                | 数据链接                                                                                                                                                                                                                                     | 来源                                            |
-|-------------------|:------------------------:|:----------------------------------:|:---------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------|
-| Alpaca-Cot EN     |       136,219,879        | 72,855,345 |   54.48%   | [alpaca-cot-en-refine.yaml](alpaca_cot/alpaca-cot-en-refine.yaml)                                                                                                                                                                   | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-en-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-en-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-en-refined-by-data-juicer)   | [来自Alpaca-CoT的39个子集](alpaca_cot/README_ZH.md#完善的-alpaca-cot-数据集元信息) |
-| Alpaca-Cot ZH     |        21,197,246        |             9,873,214              |  46.58%   | [alpaca-cot-zh-refine.yaml](alpaca_cot/alpaca-cot-zh-refine.yaml)                                                                                                                                                                   | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-zh-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-zh-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-zh-refined-by-data-juicer)   | [来自Alpaca-CoT的28个子集](alpaca_cot/README_ZH.md#完善的-alpaca-cot-数据集元信息) |
-
-## 完善前后的多模态数据集
-
-| 数据子集                    |      完善前的样本数目       | 完善后的样本数目 | 样本保留率 | 配置链接                          | 数据链接                                                                                                                                                                                                                                                                                 | 来源            |
-|---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
-| LLaVA pretrain (LCS-558k) |          558,128          |   500,380    |   89.65%   | [llava-pretrain-refine.yaml](llava-pretrain-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/LLaVA-1.5/public/llava-pretrain-refine-result.json) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/llava-pretrain-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/llava-pretrain-refined-by-data-juicer)                                        | [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) |
-| Data-Juicer (T2V, 147k) |          1,217,346          |   147,176    |   12.09%   | [data-juicer-sandbox-optimal.yaml](data-juicer-sandbox-optimal.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool)                                        | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (605k)](https://github.com/snap-research/Panda-70M) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
-| Data-Juicer (DJ, 228k) |          3,408,553          |   227,867    |   8.15%   | [data-juicer-sandbox-self-evolution.yaml](data-juicer-sandbox-self-evolution.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool_s2.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-evolution-data-pool)                                        | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (2,599k)](https://github.com/snap-research/Panda-70M) <br> [Pexels (198k)](https://github.com/cj-mills/pexels-dataset) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
-
-### 评测结果
-- LLaVA pretrain (LCS-558k): 使用**完善后的预训练数据集**预训练并使用原始的指令数据集微调后的模型在12个评测集上有10个超过了基线模型LLaVA-1.5-13B。
-
-| 模型                              | VQAv2 | GQA | VizWiz | SQA | TextVQA | POPE | MME | MM-Bench | MM-Bench-CN | SEED | LLaVA-Bench-Wild | MM-Vet |
-|---------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
-| LLaVA-1.5-13B <br> (基线)         | **80.0**  | 63.3 | 53.6 | 71.6 | **61.3** | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 |
-| LLaVA-1.5-13B <br> (完善后的预训练数据集) | 79.94 | **63.5** | **54.09** | **74.20** | 60.82 | **86.67** | **1565.53** | **68.2** | **63.9** | **61.8** | **75.9** | **37.4** |
-
-- Data-Juicer (T2V, 147k) 和 Data-Juicer (DJ, 228k): 使用**完善后的数据集**在 [VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard) 全面超过基线模型 [T2V-Turbo](https://github.com/Ji4chenLi/t2v-turbo)。这里 T2V-Turbo 是 Data-Juicer (T2V, 147k) 的teacher模型,Data-Juicer (T2V, 147k) 是 Data-Juicer (DJ, 228k) 的teacher模型,详情请参考[沙盒实验室](../../docs/Sandbox-ZH.md)。
-
-| model                         | Total Score | Quality Score | Semantic Score | subject consistency | background consistency | temporal flickering | motion smoothness | dynamic degree | aesthetic quality |
-|-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- |
-| T2V-Turbo               | 81.01 | 82.57 | 74.76 | 96.28 | 97.02 | 97.48 | 97.34 | 49.17 | 63.04 |
-| Data-Juicer (T2V, 147k) | 82.10 | 83.14 | 77.93 | 97.32 | 99.03 | 96.60 | 96.51 | **51.67** | **68.92** |
-| Data-Juicer (DJ, 228k)  | **82.53** | **83.38** | **79.13** | **97.92** | **99.27** | **98.14** | **97.77** | 38.89 | 67.39 |
-
-| model                         | imaging quality | object class | multiple objects | human action | color | spatial relationship | scene | appearance style | temporal style | overall consistency |
-|-------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
-| T2V-Turbo               | **72.49** | 93.96 | 54.65 | 95.20 | 89.90 | 38.67 | 55.58 | 24.42 | 25.51 | 28.16 |
-| Data-Juicer (T2V, 147k) | 70.42 | 95.85 | 61.63 | **95.60** | 94.06 | 46.95 | **57.57** | 24.42 | 26.34 | 28.90 |
-| Data-Juicer (DJ, 228k)  | 70.41 | **96.44** | **64.51** | 95.40 | **95.51** | **47.17** | 57.30 | **25.55** | **26.82** | **29.25** |
-
-## 视频数据集
-
-我们为用户提供了一个视频数据集处理菜谱样例以协助更好地使用视频相关的算子: [general-video-refine-example.yaml](general-video-refine-example.yaml) 。这里我们应用了三种类型的算子:
-- 仅文本:根据视频描述提高数据集质量
-- 仅视频:根据视频性质提高数据集质量
-- 文本-视频:根据文本和视频间的对齐提高数据集质量
-用户可以基于这个菜谱开始他们的视频数据集处理流程。
--
diff --git a/docs/DeveloperGuide.md b/docs/DeveloperGuide.md
index e6fa17757..91ad6ffec 100644
--- a/docs/DeveloperGuide.md
+++ b/docs/DeveloperGuide.md
@@ -1,13 +1,16 @@
 # How-to Guide for Developers
 
-- [Coding Style](#coding-style)
-- [Build your own OPs](#build-your-own-ops)
-  - [(Optional) Make your OP fusible](#optional-make-your-op-fusible)
-- [Build your own configs](#build-your-own-configs)
-  - [Fruitful config sources \& Type hints](#fruitful-config-sources--type-hints)
-  - [Hierarchical configs and helps](#hierarchical-configs-and-helps)
-
-## Coding Style
+- [1. Coding Style](#1-coding-style)
+- [2. Build Your Own OPs](#2-build-your-own-ops)
+  - [2.1 Building Illustration](#21-building-illustration)
+    - [2.1.2 Providing Basic OP Functions (alpha version)](#212-providing-basic-op-functions-alpha-version)
+  - [2.1.2 Making the OP More Usable (beta version)](#212-making-the-op-more-usable-beta-version)
+  - [2.1.3 Making OP Faster \& More complete (stable version)](#213-making-op-faster--more-complete-stable-version)
+- [3. Build Your Own Data Recipes and Configs](#3-build-your-own-data-recipes-and-configs)
+  - [3.1 Fruitful Config Sources \& Type Hints](#31-fruitful-config-sources--type-hints)
+  - [3.2 Hierarchical Configs and Helps](#32-hierarchical-configs-and-helps)
+
+## 1. Coding Style
 
 We define our styles in `.pre-commit-config.yaml`. Before committing,
 please install `pre-commit` tool to automatically check and modify accordingly:
@@ -35,17 +38,24 @@ dependencies of pre-commit are consistent with the project configuration
 (which can be completed through `pre-commit clean` and `pre-commit install`); 
 and ② execute `pre-commit run --all-files` before push.
 
-## Build your own OPs
+## 2. Build Your Own OPs
 
-- Data-Juicer allows everybody to build their own OPs.
-- Before implementing a new OP, please refer to [Operators](Operators.md) to avoid unnecessary duplication.
+- Data-Juicer allows everybody to easily build their own OPs.
+- Before implementing a new OP, please refer to existing [OperatorsZoo](Operators.md) to avoid unnecessary duplication.
 - According to the implementation progress, OP will be categorized into 3 types of versions:
   - ![alpha](https://img.shields.io/badge/alpha-red?style=plastic) version: Only the basic OP implementations are finished.
-  - ![beta](https://img.shields.io/badge/beta-yellow?style=plastic) version: Based on the alpha version, unittests for this OP are added as well.
+  - ![beta](https://img.shields.io/badge/beta-yellow?style=plastic) version: Based on the alpha version, unittests for this OP and basic docstring are added as well.
   - ![stable](https://img.shields.io/badge/stable-green?style=plastic) version: Based on the beta version, OP optimizations (e.g. model management, batched processing, OP fusion, ...)
-- Assuming we want to add a new Filter operator called "TextLengthFilter" to get corpus of expected text length, we can follow these steps to build it.
 
-1. (Optional) Add a new StatsKeys in `data_juicer/utils/constant.py` to store the statistical variable of the new OP.
+- 📣📣📣 Community contributors can submit corresponding operator PRs in the alpha state. After that, the contributor can work with the Data-Juicer team to gradually improve it to beta and stable versions in subsequent PRs. We welcome co-construction and will highlight [acknowledgements](https://github.com/modelscope/data-juicer?tab=readme-ov-file#acknowledgement)!
+
+### 2.1 Building Illustration
+  
+Assuming we want to add a new Filter operator called "TextLengthFilter" to get corpus of expected text length, we can follow the following steps to build it.
+
+#### 2.1.2 Providing Basic OP Functions (alpha version)
+
+1. (![alpha](https://img.shields.io/badge/alpha-red?style=plastic), Optional) If the new OP defines  some statistical variables, please add the corrosponding new `StatsKeys` attribute in `data_juicer/utils/constant.py` for unified management.
 
 ```python
 class StatsKeys(object):
@@ -54,8 +64,9 @@ class StatsKeys(object):
 ```
 
 2. (![alpha](https://img.shields.io/badge/alpha-red?style=plastic)) Create a new OP file `text_length_filter.py` in the corresponding `data_juicer/ops/filter/` directory as follows.
-   - It's a Filter OP, so the new OP needs to inherit from the basic `Filter` class in the `base_op.py`, and be decorated with `OPERATORS` to register itself automatically.
-   - For convenience, we can implement the core functions `compute_stats_single` and `process_single` in a single-sample way, whose input and output are a single sample dictionary. If you are very familiar with batched processing in Data-Juicer, you can also implement the batched version directly by overwriting the `compute_stats_batched` and `process_batched` functions, which will be slightly faster than single-sample version. Their input and output are a column-wise dict with multiple samples.
+   - It's a Filter OP, so the new OP needs to inherit from the basic `Filter` class in the `base_op.py`, and be decorated with `@OPERATORS.register_module(xx_op)` to register itself automatically.
+   - For convenience, we can implement the core functions `compute_stats_single` and `process_single` in a single-sample way, whose input and output are a single sample dictionary. 
+   - [Advanced] If you are familiar with batched processing in Data-Juicer, you can also implement the batched version directly by overwriting the `compute_stats_batched` and `process_batched` functions, which will be slightly faster than single-sample version. Their input and output are a column-wise dict with multiple samples (detailed in the following Section 2.1.3).
 
     ```python
     import sys
@@ -108,7 +119,86 @@ class StatsKeys(object):
                 return False
     ```
 
-    - (![stable](https://img.shields.io/badge/stable-green?style=plastic)) If Hugging Face models are used within an operator, you might want to leverage GPU acceleration. To achieve this, declare `_accelerator = 'cuda'` in the constructor, and ensure that `compute_stats_single/batched` and `process_single/batched` methods accept an additional positional argument `rank`.
+3. (![alpha](https://img.shields.io/badge/alpha-red?style=plastic)) After implemention, add it to the OP dictionary in the `__init__.py` file in `data_juicer/ops/filter/` directory.
+
+```python
+from . import (...,              # other OPs
+               text_length_filter)  # import this new OP module
+# other OPs
+from text_length_filter import TextLengthFilter  # import this new OP class
+__all__ = [
+    # other Ops
+    text_length_filter,  # add this new Op to __all__
+]
+```
+
+4. (![alpha](https://img.shields.io/badge/alpha-red?style=plastic)) When an operator has package dependencies listed in `environments/science_requires.txt`, you need to add the corresponding dependency packages to the `OPS_TO_PKG` dictionary in `data_juicer/utils/auto_install_mapping.py` to support dependency installation at the operator level.
+
+5. Now you can use this new OP with custom arguments in your own config files!
+
+```yaml
+# other configs
+...
+
+# process configs
+process:
+  - text_length_filter:  # add this OP to your process list and set the parameters
+      min_len: 10
+      max_len: 1000
+```
+
+### 2.1.2 Making the OP More Usable (beta version)
+
+6. (![beta](https://img.shields.io/badge/beta-yellow?style=plastic) strongly recommended) In order to enhance the robustness of the code, verify the correctness and intuitively show how to use its functions, it is best to unit test the newly added operators. For the `TextLengthFilter` operator above, implement a test file such as `test_text_length_filter.py` in `tests/ops/filter/`:
+
+```python
+import unittest
+from data_juicer.ops.filter.text_length_filter import TextLengthFilter
+from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
+
+class TextLengthFilterTest(DataJuicerTestCaseBase):
+
+    def test_func1(self):
+        pass
+
+    def test_func2(self):
+        pass
+
+    def test_func3(self):
+        pass
+
+if __name__ == '__main__':
+    unittest.main()
+```
+
+7. (![beta](https://img.shields.io/badge/beta-yellow?style=plastic) strongly recommend) In order to facilitate other users to understand and use, it is best to update the newly added operator information to the corresponding documents, including the following two basic actions:
+   1. Please add basic information to the doc string of the operator class to ensure that it is complete and readable (including basic function description of the operator, input parameters, output parameters, etc.). There is no need for users to write in multiple places. Our `pre-commit` and sphinx build scripts will automatically extract doc strings to form operator pool documents and API documents.
+   2. `configs/config_all.yaml`: This complete configuration file saves a list of all operators and parameters, as a source of information for some automated features and one of the important documents for users to refer to available operators. Therefore, after adding a new operator, please also add it to the document process list (grouped by operator type and sorted alphabetically):
+   
+   ```yaml
+   ...
+   - stopwords_filter:                                       # filter text with stopword ratio smaller than a specific min value
+       lang: en                                                # consider stopwords in what language
+       tokenization: false                                     # whether to use model to tokenize documents
+       min_ratio: 0.3                                          # the min ratio to filter text
+       stopwords_dir: ./assets                                 # directory to store stopwords dictionaries
+       use_words_aug: false                                    # whether to augment words, especially for Chinese and Vietnamese
+       words_aug_group_sizes: [2]                              # the group size of words to augment
+       words_aug_join_char: ""                                 # the join char between words to augment
+   - text_length_filter:                                     # filter text with length out of specific range
+       min_len: 10                                             # the min length of filter range
+       max_len: 10000                                          # the max length of filter range
+   - token_num_filter:                                       # filter text with total token number out of specific range
+       hf_tokenizer: EleutherAI/pythia-6.9b-deduped            # name of used Hugging Face tokenizer
+       min_num: 10                                             # the min number of filter range
+       max_num: 10000                                          # the max number of filter range
+   ...
+   ```
+
+
+### 2.1.3 Making OP Faster & More complete (stable version)
+
+- (![stable](https://img.shields.io/badge/stable-green?style=plastic)) If Hugging Face models are used within an operator, you might want to leverage GPU acceleration. To achieve this, declare `_accelerator = 'cuda'` in the OP's constructor, and ensure that `compute_stats_single/batched` and `process_single/batched` methods accept an additional positional argument `rank`.
 
     ```python
     # ... (same as above)
@@ -132,7 +222,7 @@ class StatsKeys(object):
             # ... (same as above)
     ```
 
-    - (![stable](https://img.shields.io/badge/stable-green?style=plastic)) If the operator processes data in batches rather than a single sample, or you want to enable batched processing, it is necessary to declare `_batched_op = True`.
+- (![stable](https://img.shields.io/badge/stable-green?style=plastic)) If the operator processes data in batches rather than a single sample, or you want to enable batched processing, it is necessary to declare `_batched_op = True`.
       - For the original `compute_stats_single` and `process_single` functions, you can keep it still and Data-Juicer will call the default batched version to call the single version to support batched processing. Or you can implement your batched version in a more efficient way.
     ```python
     # ... (import some other libraries)
@@ -152,7 +242,7 @@ class StatsKeys(object):
             # ... (some codes)
     ```
 
-    - (![stable](https://img.shields.io/badge/stable-green?style=plastic)) In a mapper operator, to avoid process conflicts and data coverage, we offer an interface to make a saving path for produced extra datas. The format of the saving path is `{ORIGINAL_DATAPATH}/__dj__produced_data__/{OP_NAME}/{ORIGINAL_FILENAME}__dj_hash_#{HASH_VALUE}#.{EXT}`, where the `HASH_VALUE` is hashed from the init parameters of the operator, the related parameters in each sample, the process ID, and the timestamp. For convenience, we can call `self.remove_extra_parameters(locals())` at the beginning of the initiation to get the init parameters. At the same time, we can call `self.add_parameters` to add related parameters with the produced extra datas from each sample. Take the operator which enhances the images with diffusion models as example:
+- (![stable](https://img.shields.io/badge/stable-green?style=plastic)) In a mapper operator, to avoid process conflicts and data coverage, we offer an interface to make a saving path for produced extra datas. The format of the saving path is `{ORIGINAL_DATAPATH}/__dj__produced_data__/{OP_NAME}/{ORIGINAL_FILENAME}__dj_hash_#{HASH_VALUE}#.{EXT}`, where the `HASH_VALUE` is hashed from the init parameters of the operator, the related parameters in each sample, the process ID, and the timestamp. For convenience, we can call `self.remove_extra_parameters(locals())` at the beginning of the initiation to get the init parameters. At the same time, we can call `self.add_parameters` to add related parameters with the produced extra datas from each sample. Take the operator which enhances the images with diffusion models as example:
     ```python
     from data_juicer.utils.file_utils import transfer_filename
     # ... (import some other libraries)
@@ -199,84 +289,7 @@ class StatsKeys(object):
             # ... (some codes)
     ```
 
-3. (![alpha](https://img.shields.io/badge/alpha-red?style=plastic)) After implemention, add it to the OP dictionary in the `__init__.py` file in `data_juicer/ops/filter/` directory.
-
-```python
-from . import (...,              # other OPs
-               text_length_filter)  # import this new OP module
-# other OPs
-from text_length_filter import TextLengthFilter  # import this new OP class
-__all__ = [
-    # other Ops
-    text_length_filter,  # add this new Op to __all__
-]
-```
-
-4. (![alpha](https://img.shields.io/badge/alpha-red?style=plastic)) When an operator has package dependencies listed in `environments/science_requires.txt`, you need to add the corresponding dependency packages to the `OPS_TO_PKG` dictionary in `data_juicer/utils/auto_install_mapping.py` to support dependency installation at the operator level.
-
-5. Now you can use this new OP with custom arguments in your own config files!
-
-```yaml
-# other configs
-...
-
-# process configs
-process:
-  - text_length_filter:  # add this OP to your process list and set the parameters
-      min_len: 10
-      max_len: 1000
-```
-
-6. (![beta](https://img.shields.io/badge/beta-yellow?style=plastic) Strongly Recommend) It's better to add corresponding tests for your own OPs. For `TextLengthFilter` above, you would like to add `test_text_length_filter.py` into `tests/ops/filter/` directory as below.
-
-```python
-import unittest
-from data_juicer.ops.filter.text_length_filter import TextLengthFilter
-from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
-
-class TextLengthFilterTest(DataJuicerTestCaseBase):
-
-    def test_func1(self):
-        pass
-
-    def test_func2(self):
-        pass
-
-    def test_func3(self):
-        pass
-
-if __name__ == '__main__':
-    unittest.main()
-```
-
-7. (![stable](https://img.shields.io/badge/stable-green?style=plastic) Strongly Recommend) In order to facilitate the use of other users, we also need to update this new OP information to
-the corresponding documents, including the following docs:
-   1. `configs/config_all.yaml`: this complete config file contains a list of all OPs and their arguments, serving as an
-   important document for users to refer to all available OPs. Therefore, after adding the new OP, we need to add it to the process
-   list (grouped by the OP type and sorted in alphabetical order):
-   
-   ```yaml
-   ...
-   - stopwords_filter:                                       # filter text with stopword ratio smaller than a specific min value
-       lang: en                                                # consider stopwords in what language
-       tokenization: false                                     # whether to use model to tokenize documents
-       min_ratio: 0.3                                          # the min ratio to filter text
-       stopwords_dir: ./assets                                 # directory to store stopwords dictionaries
-       use_words_aug: false                                    # whether to augment words, especially for Chinese and Vietnamese
-       words_aug_group_sizes: [2]                              # the group size of words to augment
-       words_aug_join_char: ""                                 # the join char between words to augment
-   - text_length_filter:                                     # filter text with length out of specific range
-       min_len: 10                                             # the min length of filter range
-       max_len: 10000                                          # the max length of filter range
-   - token_num_filter:                                       # filter text with total token number out of specific range
-       hf_tokenizer: EleutherAI/pythia-6.9b-deduped            # name of used Hugging Face tokenizer
-       min_num: 10                                             # the min number of filter range
-       max_num: 10000                                          # the max number of filter range
-   ...
-   ```
-
-
-### (![stable](https://img.shields.io/badge/stable-green?style=plastic) Optional) Make your OP fusible
+(![stable](https://img.shields.io/badge/stable-green?style=plastic) Optional) **Make your OP fusible**
 
 - If the calculation process of some intermediate variables in the new OP is reused in other existing OPs, this new OP can be
 added to the fusible OPs to accelerate the whole data processing with OP fusion technology. (e.g. both the `words_num_filter`
@@ -386,10 +399,12 @@ class PerplexityFilter(Filter):
         # ... (some codes)
 ```
 
-## Build your own configs
+## 3. Build Your Own Data Recipes and Configs
 - We provide easy configuration based on [jsonargparse](https://github.com/omni-us/jsonargparse/) to reduce cost for boilerplate codes.
+- We provide fruitful examples in [Data Recipe Gallery](../docs/RecipeGallery.md) for reference reuse and extension.
+- 📣📣📣 Community contributors can submit PRs in the [Data Recipe Gallery] to add customized data recipes to promote dissemination, reuse and related technical evolution. We welcome co-construction and will highlight [acknowledgements](https://github.com/modelscope/data-juicer?tab=readme-ov-file#acknowledgement)!
 
-### Fruitful config sources & Type hints
+### 3.1 Fruitful Config Sources & Type Hints
 - A global config object can be initialized via
 ```
 # core.executor.py
@@ -411,7 +426,7 @@ extended [types](https://jsonargparse.readthedocs.io/en/stable/#type-hints)
 from jsonargparse, such as `restricted types` and `Paths` with customized
 limitations.
 
-### Hierarchical configs and helps
+### 3.2 Hierarchical Configs and Helps
 - You can use dot notation in the argument names freely to define the
 hierarchy, e.g., `maximum_line_length_filter.min`.
 More importantly, by default, we automatically register the configs from
diff --git a/docs/DeveloperGuide_ZH.md b/docs/DeveloperGuide_ZH.md
index 47d333cfc..b3d9ae804 100644
--- a/docs/DeveloperGuide_ZH.md
+++ b/docs/DeveloperGuide_ZH.md
@@ -1,14 +1,16 @@
 # 开发者指南
 
-- [开发者指南](#开发者指南)
-  - [编码规范](#编码规范)
-  - [构建自己的算子](#构建自己的算子)
-    - [(可选)使新算子可以进行算子融合](#可选使新算子可以进行算子融合)
-  - [构建自己的配置](#构建自己的配置)
-    - [丰富的配置源和类型提示](#丰富的配置源和类型提示)
-    - [层次化的配置和帮助](#层次化的配置和帮助)
-
-## 编码规范
+- [1.编码规范](#1编码规范)
+- [2.构建自己的算子](#2构建自己的算子)
+  - [2.1 构建示例](#21-构建示例)
+    - [2.1.2 提供算子基本功能(alpha版本)](#212-提供算子基本功能alpha版本)
+    - [2.1.2 使算子更可用(beta版本)](#212-使算子更可用beta版本)
+    - [2.1.3 使算子更快更完备(stable版本)](#213-使算子更快更完备stable版本)
+- [3. 构建自己的数据菜谱和配置](#3-构建自己的数据菜谱和配置)
+  - [3.1 丰富的配置源和类型提示](#31-丰富的配置源和类型提示)
+  - [3.2 层次化的配置和帮助](#32-层次化的配置和帮助)
+
+## 1.编码规范
 
 我们将编码规范定义在 `.pre-commit-config.yaml` 中。在向仓库贡献代码之前,请使用 `pre-commit` 工具对代码进行自动规范化。
 
@@ -31,17 +33,22 @@ git commit -m "<your_commit_message>"
 
 **注意**:我们在github workflow配置了pre-commit的检查。如果您的PR中该检查没通过,请在本地①确保pre-commit 的相关依赖与项目配置一致(可通过`pre-commit clean`和`pre-commit install`完成);②push前执行了`pre-commit run --all-files`.
 
-## 构建自己的算子
+## 2.构建自己的算子
 
-- Data-Juicer 支持每个人定义自己的算子。
-- 在实现新的算子之前,请参考 [Operators](Operators.md) 以避免不必要的重复。
+- Data-Juicer 支持每个人灵活、便捷定义自己的算子。
+- 在实现新的算子之前,请参考已有 [算子池](Operators.md) 以避免不必要的重复。
 - 根据实现完整性,算子会被分类为3类:
   - ![alpha](https://img.shields.io/badge/alpha-red?style=plastic) 版本:仅实现了最基本的算子能力
-  - ![beta](https://img.shields.io/badge/beta-yellow?style=plastic) 版本:在 alpha 版本基础上为算子添加了单元测试
+  - ![beta](https://img.shields.io/badge/beta-yellow?style=plastic) 版本:在 alpha 版本基础上为算子添加了单元测试,补充基础文档描述
   - ![stable](https://img.shields.io/badge/stable-green?style=plastic) 版本:在 beta 版本基础上进行了各项算子优化(如模型管理、批处理、算子融合等)
-- 假设要添加一个名为 “TextLengthFilter” 的运算符以过滤仅包含预期文本长度的样本语料,可以按照以下步骤进行构建。
+- 📣📣📣 社区贡献者可在alpha状态后就提相应算子PR。此后该贡献者可以与Data-Juicer团队一起在后续PR中,将其渐进完善到beta和stable版本。我们非常欢迎共建,并会高亮[致谢](https://github.com/modelscope/data-juicer?tab=readme-ov-file#acknowledgement)!
+
+### 2.1 构建示例
+下面以 “TextLengthFilter” 的算子(过滤仅包含预期文本长度的样本语料)为例,展示相应开发构建过程。
+
+#### 2.1.2 提供算子基本功能(alpha版本)
 
-1. (可选) 在 `data_juicer/utils/constant.py` 文件中添加一个新的StatsKeys来保存新算子的统计变量。
+1. (![alpha](https://img.shields.io/badge/alpha-red?style=plastic),可选) 如果该算子定义了某个统计变量,那么请在 `data_juicer/utils/constant.py` 文件中添加一个新的`StatsKeys`属性来统一保存管理。
 
 ```python
 class StatsKeys(object):
@@ -50,8 +57,9 @@ class StatsKeys(object):
 ```
 
 2. (![alpha](https://img.shields.io/badge/alpha-red?style=plastic)) 在 `data_juicer/ops/filter/` 目录下创建一个新的算子文件 `text_length_filter.py`,内容如下:
-    - 因为它是一个 Filter 算子,所以需要继承 `base_op.py` 中的 `Filter` 基类,并用 `OPERATORS` 修饰以实现自动注册。
-    - 为了方便实现,我们可以以单样本处理的方式实现两个核心方法 `compute_stats_single` 和 `process_single`,它们的输入输出均为单个样本的字典结构。如果你比较熟悉 Data-Juicer 中的batch化处理,你也可以通过覆写 `compute_stats_batched` 和 `process_batched` 方法直接实现它们的batch化版本,它的处理会比单样本版本稍快一些。它们的输入和输出则是按列存储的字典结构,其中包括多个样本。
+    - 因为它是一个 Filter 算子,所以需要继承 `base_op.py` 中的 `Filter` 基类,并用 `@OPERATORS.register_module(xx_op)` 装饰器标记,以实现自动注册。
+    - 为了方便实现,我们可以按单样本处理的方式实现两个核心方法 `compute_stats_single` 和 `process_single`,它们的输入输出均为单个样本的字典结构。
+    - 【进阶】如果你比较熟悉 Data-Juicer 中的batch化处理,你也可以通过覆写 `compute_stats_batched` 和 `process_batched` 方法直接实现它们的batch化版本,它的处理会比单样本版本稍快一些。它们的输入和输出则是按列存储的字典结构,其中包括多个样本 (详见下方 2.1.3 小节)。
 
     ```python
     import sys
@@ -104,7 +112,88 @@ class StatsKeys(object):
                 return False
     ```
 
-    - (![stable](https://img.shields.io/badge/stable-green?style=plastic)) 如果在算子中使用了 Hugging Face 模型,您可能希望利用 GPU 加速。为了实现这一点,请在构造函数中声明 `_accelerator = 'cuda'`,并确保 `compute_stats_single/batched` 和 `process_single/batched` 方法接受一个额外的位置参数 `rank`。
+
+3. (![alpha](https://img.shields.io/badge/alpha-red?style=plastic)) 实现后,将其添加到 `data_juicer/ops/filter` 目录下 `__init__.py` 文件中的算子字典中:
+
+```python
+from . import (...,              # other OPs
+               text_length_filter)  # import this new OP module
+# other OPs
+from text_length_filter import TextLengthFilter  # import this new OP class
+__all__ = [
+    # other Ops
+    text_length_filter,  # add this new Op to __all__
+]
+```
+
+4. (![alpha](https://img.shields.io/badge/alpha-red?style=plastic)) 算子有`environments/science_requires.txt`中列举的包依赖时,需要在`data_juicer/utils/auto_install_mapping.py`里的`OPS_TO_PKG`中添加对应的依赖包,以支持算子粒度的依赖安装。
+
+5. 全部完成!现在您可以在自己的配置文件中使用新添加的算子:
+
+```yaml
+# other configs
+...
+
+# process configs
+process:
+  - text_length_filter:  # add this op to your process list and set the parameters
+      min_len: 10
+      max_len: 1000
+```
+
+#### 2.1.2 使算子更可用(beta版本)
+
+6. (![beta](https://img.shields.io/badge/beta-yellow?style=plastic) 强烈推荐)为了增强代码鲁棒性、验证正确性和直观展示如何使用其功能,最好为新添加的算子进行单元测试。对于上面的 `TextLengthFilter` 算子,在 `tests/ops/filter/` 中实现如 `test_text_length_filter.py` 的测试文件:
+
+```python
+import unittest
+from data_juicer.ops.filter.text_length_filter import TextLengthFilter
+from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
+
+
+class TextLengthFilterTest(DataJuicerTestCaseBase):
+
+    def test_func1(self):
+        pass
+
+    def test_func2(self):
+        pass
+
+    def test_func3(self):
+        pass
+        
+if __name__ == '__main__':
+    unittest.main()
+```
+
+1. (![beta](https://img.shields.io/badge/beta-yellow?style=plastic) 强烈推荐)为了方便其他用户理解和使用,最好将新增的算子信息更新到相应的文档中,具体包括如下两个基本动作:
+   1. 请在算子基类的doc string中补充基础信息,确保其完整可读(包括算子基本功能描述、入参、出参等)。无需用户麻烦地多处撰写,我们的`pre-commit`和sphinx构建脚本会自动抽取doc string形成算子池文档和API文档。
+   2. `configs/config_all.yaml`:该全集配置文件保存了所有算子及参数的一个列表,作为一些自动化特性的信息来源以及用户参考可用算子的一个重要文档之一。因此,在新增算子后,请将其也添加到该文档process列表里(按算子类型分组并按字母序排序):
+   
+   ```yaml
+   ...
+   - stopwords_filter:                                       # filter text with stopword ratio smaller than a specific min value
+       lang: en                                                # consider stopwords in what language
+       tokenization: false                                     # whether to use model to tokenize documents
+       min_ratio: 0.3                                          # the min ratio to filter text
+       stopwords_dir: ./assets                                 # directory to store stopwords dictionaries
+       use_words_aug: false                                    # whether to augment words, especially for Chinese and Vietnamese
+       words_aug_group_sizes: [2]                              # the group size of words to augment
+       words_aug_join_char: ""                                 # the join char between words to augment
+   - text_length_filter:                                     # filter text with length out of specific range
+       min_len: 10                                             # the min length of filter range
+       max_len: 10000                                          # the max length of filter range
+   - token_num_filter:                                       # filter text with total token number out of specific range
+       hf_tokenizer: EleutherAI/pythia-6.9b-deduped            # name of used Hugging Face tokenizer
+       min_num: 10                                             # the min number of filter range
+       max_num: 10000                                          # the max number of filter range
+   ...
+   ```
+
+
+#### 2.1.3 使算子更快更完备(stable版本)
+
+- (![stable](https://img.shields.io/badge/stable-green?style=plastic)) 如果在算子中使用了 Hugging Face 模型,您可能希望利用 GPU 加速。为了实现这一点,请在算子的构造函数中声明 `_accelerator = 'cuda'`,并确保 `compute_stats_single/batched` 和 `process_single/batched` 方法接受一个额外的位置参数 `rank`。
 
     ```python
     # ... (same as above)
@@ -128,7 +217,7 @@ class StatsKeys(object):
             # ... (same as above)
     ```
 
-    - (![stable](https://img.shields.io/badge/stable-green?style=plastic)) 如果算子批量处理数据,输入不是一个样本而是一个batch,或者你想在单样本实现上直接激活batch化处理,需要声明`_batched_op = True`。
+- (![stable](https://img.shields.io/badge/stable-green?style=plastic)) 如果算子批量处理数据,输入不是一个样本而是一个batch,或者你想在单样本实现上直接激活batch化处理,需要声明`_batched_op = True`。
       - 对于单样本实现中原来的 `compute_stats_single` 和 `process_single` 方法,你可以保持它们不变,Data-Juicer 会调用默认的batch化处理版本,它们会自动拆分单个样本以调用单样本版本的两个方法来支持batch化处理。你也可以自行实现更高效的batch化的版本。
     ```python
     # ... (import some other libraries)
@@ -148,7 +237,7 @@ class StatsKeys(object):
             # ... (some codes)
     ```
 
-    - (![stable](https://img.shields.io/badge/stable-green?style=plastic)) 在mapper算子中,我们提供了产生额外数据的存储路径生成接口,避免出现进程冲突和数据覆盖的情况。生成的存储路径格式为`{ORIGINAL_DATAPATH}/__dj__produced_data__/{OP_NAME}/{ORIGINAL_FILENAME}__dj_hash_#{HASH_VALUE}#.{EXT}`,其中`HASH_VALUE`是算子初始化参数、每个样本中相关参数、进程ID和时间戳的哈希值。为了方便,可以在OP类初始化开头调用`self.remove_extra_parameters(locals())`获取算子初始化参数,同时可以调用`self.add_parameters`添加每个样本与生成额外数据相关的参数。例如,利用diffusion模型对图像进行增强的算子:
+- (![stable](https://img.shields.io/badge/stable-green?style=plastic)) 在mapper算子中,我们提供了产生额外数据的存储路径生成接口,避免出现进程冲突和数据覆盖的情况。生成的存储路径格式为`{ORIGINAL_DATAPATH}/__dj__produced_data__/{OP_NAME}/{ORIGINAL_FILENAME}__dj_hash_#{HASH_VALUE}#.{EXT}`,其中`HASH_VALUE`是算子初始化参数、每个样本中相关参数、进程ID和时间戳的哈希值。为了方便,可以在OP类初始化开头调用`self.remove_extra_parameters(locals())`获取算子初始化参数,同时可以调用`self.add_parameters`添加每个样本与生成额外数据相关的参数。例如,利用diffusion模型对图像进行增强的算子:
     ```python
     # ... (import some library)
     OP_NAME = 'image_diffusion_mapper'
@@ -193,82 +282,8 @@ class StatsKeys(object):
             # ... (some codes)
     ```
 
-3. (![alpha](https://img.shields.io/badge/alpha-red?style=plastic)) 实现后,将其添加到 `data_juicer/ops/filter` 目录下 `__init__.py` 文件中的算子字典中:
-
-```python
-from . import (...,              # other OPs
-               text_length_filter)  # import this new OP module
-# other OPs
-from text_length_filter import TextLengthFilter  # import this new OP class
-__all__ = [
-    # other Ops
-    text_length_filter,  # add this new Op to __all__
-]
-```
-
-4. (![alpha](https://img.shields.io/badge/alpha-red?style=plastic)) 算子有`environments/science_requires.txt`中列举的包依赖时,需要在`data_juicer/utils/auto_install_mapping.py`里的`OPS_TO_PKG`中添加对应的依赖包,以支持算子粒度的依赖安装。
-
-5. 全部完成!现在您可以在自己的配置文件中使用新添加的算子:
-
-```yaml
-# other configs
-...
-
-# process configs
-process:
-  - text_length_filter:  # add this op to your process list and set the parameters
-      min_len: 10
-      max_len: 1000
-```
-
-6. (![beta](https://img.shields.io/badge/beta-yellow?style=plastic) 强烈推荐)最好为新添加的算子进行单元测试。对于上面的 `TextLengthFilter` 算子,建议在 `tests/ops/filter/` 中实现如 `test_text_length_filter.py` 的测试文件:
-
-```python
-import unittest
-from data_juicer.ops.filter.text_length_filter import TextLengthFilter
-from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
-
-
-class TextLengthFilterTest(DataJuicerTestCaseBase):
-
-    def test_func1(self):
-        pass
-
-    def test_func2(self):
-        pass
-
-    def test_func3(self):
-        pass
-        
-if __name__ == '__main__':
-    unittest.main()
-```
-
-7. (![stable](https://img.shields.io/badge/stable-green?style=plastic) 强烈推荐)为了方便其他用户使用,我们还需要将新增的算子信息更新到相应的文档中,具体包括如下文档:
-   1. `configs/config_all.yaml`:该全集配置文件保存了所有算子及参数的一个列表,作为用户参考可用算子的一个重要文档。因此,在新增算子后,需要将其添加到该文档process列表里(按算子类型分组并按字母序排序):
-   
-   ```yaml
-   ...
-   - stopwords_filter:                                       # filter text with stopword ratio smaller than a specific min value
-       lang: en                                                # consider stopwords in what language
-       tokenization: false                                     # whether to use model to tokenize documents
-       min_ratio: 0.3                                          # the min ratio to filter text
-       stopwords_dir: ./assets                                 # directory to store stopwords dictionaries
-       use_words_aug: false                                    # whether to augment words, especially for Chinese and Vietnamese
-       words_aug_group_sizes: [2]                              # the group size of words to augment
-       words_aug_join_char: ""                                 # the join char between words to augment
-   - text_length_filter:                                     # filter text with length out of specific range
-       min_len: 10                                             # the min length of filter range
-       max_len: 10000                                          # the max length of filter range
-   - token_num_filter:                                       # filter text with total token number out of specific range
-       hf_tokenizer: EleutherAI/pythia-6.9b-deduped            # name of used Hugging Face tokenizer
-       min_num: 10                                             # the min number of filter range
-       max_num: 10000                                          # the max number of filter range
-   ...
-   ```
-
 
-### (![stable](https://img.shields.io/badge/stable-green?style=plastic) 可选)使新算子可以进行算子融合
+(![stable](https://img.shields.io/badge/stable-green?style=plastic) 可选)**使新算子可以进行算子融合**
 
 - 如果我们的新算子中的部分中间变量的计算过程与已有的算子重复,那么可以将其添加到可融合算子中,以在数据处理时利用算子融合进行加速。(如`words_num_filter`与`word_repetition_filter`都需要对输入文本进行分词)
 - 当算子融合(OP Fusion)功能开启时,这些重复的计算过程和中间变量是可以在算子之间的`context`中共享的,从而可以减少重复计算。
@@ -369,11 +384,13 @@ class PerplexityFilter(Filter):
 
 - 至此,该算子已经能够在算子融合功能开启后,自动地与其他算子进行融合并共享共有的中间变量,减少重复计算,加快整体的数据处理速度
 
-## 构建自己的配置
+## 3. 构建自己的数据菜谱和配置
 
 - 我们提供基于 [jsonargparse](https://github.com/omni-us/jsonargparse/) 的简单配置以降低样板代码的成本。
+- 我们提供大量的示例性菜谱以供参阅复用和扩展,[数据菜谱Gallery](../docs/RecipeGallery_ZH.md)。
+- 📣📣📣 社区贡献者可提PR在*数据菜谱Gallery*中添加自定义的数据菜谱,促进传播、复用和相关技术演进。我们非常欢迎共建,并会高亮[致谢](https://github.com/modelscope/data-juicer?tab=readme-ov-file#acknowledgement)!
 
-### 丰富的配置源和类型提示
+### 3.1 丰富的配置源和类型提示
 
 - 全局配置对象可以通过以下方式初始化
 
@@ -393,7 +410,7 @@ self.cfg = init_configs()
 此外,还支持许多参数类型和相应的验证。
 包含 Python内置类型、来自 [Lib/typing](https://docs.python.org/3/library/typing.html) 的类型,以及来自 jsonargparse 的 [扩展类型](https://jsonargparse.readthedocs.io/en/stable/#type-hints),例如具有自定义限制的 `restricted types` 和 `Paths`。
 
-### 层次化的配置和帮助
+### 3.2 层次化的配置和帮助
 
 - 您可以在参数名称中自由使用点符号来定义层次结构, 例如 `maximum_line_length_filter.min`.
 更重要的是,默认情况下,我们自动注册已实现的运算符的 docstring。 也就是说,所有的结构配置始终与代码同步。
diff --git a/docs/Distributed.md b/docs/Distributed.md
index 21314a6f5..07abdfcf4 100644
--- a/docs/Distributed.md
+++ b/docs/Distributed.md
@@ -10,8 +10,7 @@ For reference, in our experiments with 25 to 100 Alibaba Cloud nodes, Data-Juice
 
 More details can be found in our paper, [Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models](arXiv_link_coming_soon).
 
-![Arch-Overview](
-https://img.alicdn.com/imgextra/i2/O1CN01EteoQ31taUweAW1UE_!!6000000005918-2-tps-4034-4146.png)
+<img src="https://img.alicdn.com/imgextra/i2/O1CN01EteoQ31taUweAW1UE_!!6000000005918-2-tps-4034-4146.png" align="center" width="600" />
 
 ## Implementation and Optimizations
 
@@ -24,7 +23,7 @@ https://img.alicdn.com/imgextra/i2/O1CN01EteoQ31taUweAW1UE_!!6000000005918-2-tps
 
 When dealing with tens of thousands of nodes but only a few dataset files, Ray would split the dataset files according to available resources and distribute the blocks across all nodes, incurring huge network communication costs and reduces CPU utilization. For more details, see [Ray's autodetect_parallelism](https://github.com/ray-project/ray/blob/2dbd08a46f7f08ea614d8dd20fd0bca5682a3078/python/ray/data/_internal/util.py#L201-L205) and [tuning output blocks for Ray](https://docs.ray.io/en/latest/data/performance-tips.html#tuning-output-blocks-for-read).
 
-This default execution plan can be quite inefficient especially for scenarios with large number of nodes. To optimize performance for such cases, we automatically splitting the original dataset into smaller files in advance, taking into consideration the features of Ray and Arrow. When users encounter such performance issues, they can utilize this feature or split the dataset according to their own preferences. In our auto-split strategy, the single file size is set to 128MB, and the result should ensure that the number of sub-files after splitting is at least twice the total number of CPU cores available in the cluster.
+This default execution plan can be quite inefficient especially for scenarios with large number of nodes. To optimize performance for such cases, we automatically splitting the original dataset into smaller files in advance, taking into consideration the features of Ray and Arrow. When users encounter such performance issues, they can utilize this feature or split the dataset according to their own preferences. In our auto-split strategy, the single file size is set to 128MB, and the result should ensure that the number of sub-files after splitting is at least twice the total number of CPU cores available in the cluster. The corresponding tool can be obtained in tools/data_resplit.py.
 
 ### Streaming Reading of JSON Files
 
diff --git a/docs/Distributed_ZH.md b/docs/Distributed_ZH.md
index c1c6c24ce..25535796f 100644
--- a/docs/Distributed_ZH.md
+++ b/docs/Distributed_ZH.md
@@ -10,8 +10,7 @@ Data-Juicer 支持基于 [Ray](https://github.com/ray-project/ray) 和阿里巴
 
 更多细节请参考我们的论文:[Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models](arXiv_link_coming_soon) 。
 
-![Arch-Overview](
-https://img.alicdn.com/imgextra/i2/O1CN01EteoQ31taUweAW1UE_!!6000000005918-2-tps-4034-4146.png)
+<img src="https://img.alicdn.com/imgextra/i2/O1CN01EteoQ31taUweAW1UE_!!6000000005918-2-tps-4034-4146.png" align="center" width="600" />
 
 ## 实现与优化
 
@@ -24,7 +23,7 @@ https://img.alicdn.com/imgextra/i2/O1CN01EteoQ31taUweAW1UE_!!6000000005918-2-tps
 
 当在上万个节点中处理仅有若干个文件的数据集时, Ray 会根据可用资源分割数据集文件,并将它们分发到所有节点上,这可能带来极大的网络通信开销并减少 CPU 利用率。更多细节可以参考文档 [Ray's autodetect_parallelism](https://github.com/ray-project/ray/blob/2dbd08a46f7f08ea614d8dd20fd0bca5682a3078/python/ray/data/_internal/util.py#L201-L205) 和 [tuning output blocks for Ray](https://docs.ray.io/en/latest/data/performance-tips.html#tuning-output-blocks-for-read) 。
 
-这种默认执行计划可能非常低效,尤其是在节点数量较多的情况下。为了优化此类情况的性能,我们考虑到 Ray 和 Arrow 的特性,提前将原始数据集自动拆分为较小的文件。当用户遇到此类性能问题时,他们可以利用此功能或根据偏好自己拆分数据集。在我们的自动拆分策略中,单个文件大小设置为 128MB,且结果应确保 拆分后的子文件数量 至少是 集群中可用CPU核心总数 的两倍。
+这种默认执行计划可能非常低效,尤其是在节点数量较多的情况下。为了优化此类情况的性能,我们考虑到 Ray 和 Arrow 的特性,提前将原始数据集自动拆分为较小的文件。当用户遇到此类性能问题时,他们可以利用此功能或根据偏好自己拆分数据集。在我们的自动拆分策略中,单个文件大小设置为 128MB,且结果应确保 拆分后的子文件数量 至少是 集群中可用CPU核心总数 的两倍。对应工具可在tools/data_resplit.py获取。
 
 
 ### JSON 文件的流式读取
diff --git a/docs/RecipeGallery.md b/docs/RecipeGallery.md
new file mode 100644
index 000000000..290201ffe
--- /dev/null
+++ b/docs/RecipeGallery.md
@@ -0,0 +1,128 @@
+# Data Recipe Gallery
+
+- The recipe [folder](../configs) contains fruitful sample configuration files of Data-Juicer data recipes, which helps users easily understand, reuse and expand the configurations in various functional scenarios.
+- 📣📣📣 Community contributors can submit PRs to add customized data recipes to promote dissemination, reuse and related technology evolution. We welcome co-construction and will highlight [acknowledgements](https://github.com/modelscope/data-juicer?tab=readme-ov-file#acknowledgement)!
+
+Table of Contents
+- [1. Data-Juicer Minimal Example Recipe](#1-data-juicer-minimal-example-recipe)
+- [2. Reproduce Open Source Text Datasets](#2-reproduce-open-source-text-datasets)
+- [3. Improved Open Source Text Pre-training Datasets](#3-improved-open-source-text-pre-training-datasets)
+- [4. Improved open source text post-processing dataset](#4-improved-open-source-text-post-processing-dataset)
+- [5. Synthetic contrastive learning image and text datasets](#5-synthetic-contrastive-learning-image-and-text-datasets)
+- [6. Improved open source image and text datasets](#6-improved-open-source-image-and-text-datasets)
+  - [6.1. Evaluation and Verification](#61-evaluation-and-verification)
+- [7. Basic Example Recipes for Video Data](#7-basic-example-recipes-for-video-data)
+- [8. Synthesize a human-centric video review set](#8-synthesize-a-human-centric-video-review-set)
+- [9. Improve existing open source video datasets](#9-improve-existing-open-source-video-datasets)
+  - [9.1. Evaluation and Verification](#91-evaluation-and-verification)
+
+
+## 1. Data-Juicer Minimal Example Recipe
+Some basic configuration files are placed in the [Demo](../configs/demo/) folder to help users quickly familiarize themselves with the basic functions of Data-Juicer. Please refer to the folder for detailed description.
+
+## 2. Reproduce Open Source Text Datasets
+- We reproduced the processing flow of part of the Redpajama dataset. Please refer to the [reproduced_redpajama](../configs/reproduced_redpajama) folder for detailed description.
+- We reproduced the processing flow of part of the BLOOM dataset. Please refer to the [reproduced_bloom](../configs/reproduced_bloom) folder for detailed description.
+
+## 3. Improved Open Source Text Pre-training Datasets
+
+We found that there are still some "bad" data samples in the existing processed datasets (such as Redpajama, The Pile, etc.). So we use our Data-Juicer to refine these datasets and try to feed them to LLM to get better performance.
+
+We use a simple 3-σ rule to set the hyperparameters of the operators in each data processing recipe.
+
+| Data subset | Number of samples before refinement | Number of samples after refinement | Sample retention rate | Config link | Data link | Source |
+|----------------------|:---------------------------:|:--------------:|:---------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------|
+| arXiv                |          1,724,497          |   1,655,259    |   95.99%   | [redpajama-arxiv-refine.yaml](../configs/data_juicer_recipes/redpajama-arxiv-refine.yaml)                                                                                                                                                                         | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-arxiv-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-arxiv-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-arxiv-refined-by-data-juicer)                                        | Redpajama               |
+| Books                |           205,182           |    195,983     |   95.51%   | [redpajama-book-refine.yaml](../configs/data_juicer_recipes/redpajama-book-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-book-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-book-refined-by-data-juicer/summary)   <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-book-refined-by-data-juicer)                                        | Redpajama               |
+| Wikipedia            |         29,834,171          |   26,990,659   |   90.47%   | [redpajama-wiki-refine.yaml](../configs/data_juicer_recipes/redpajama-wiki-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-wiki-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-wiki-refined-by-data-juicer/summary)   <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-wiki-refined-by-data-juicer)                                        | Redpajama               |
+| C4                   |         364,868,892         |  344,491,171   |   94.42%   | [redpajama-c4-refine.yaml](../configs/data_juicer_recipes/redpajama-c4-refine.yaml)                                                                                                                                                                               | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-c4-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-c4-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-c4-refined-by-data-juicer)                                             | Redpajama               |
+| Common Crawl 2019-30 |         81,085,420          |   36,557,283   |   45.08%   | [redpajama-cc-2019-30-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2019-30-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2019-30-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2019-30-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2019-30-refined-by-data-juicer)  | Redpajama               |
+| Common Crawl 2020-05 |         90,850,492          |   42,612,596   |   46.90%   | [redpajama-cc-2020-05-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2020-05-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2020-05-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2020-05-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2020-05-refined-by-data-juicer)  | Redpajama               |
+| Common Crawl 2021-04 |         98,878,523          |   44,724,752   |   45.23%   | [redpajama-cc-2021-04-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2021-04-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2021-04-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2021-04-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2021-04-refined-by-data-juicer)  | Redpajama               |
+| Common Crawl 2022-05 |         94,058,868          |   42,648,496   |   45.34%   | [redpajama-cc-2022-05-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2022-05-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2022-05-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2022-05-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2022-05-refined-by-data-juicer)  | Redpajama               |
+| Common Crawl 2023-06 |         111,402,716         |   50,643,699   |   45.46%   | [redpajama-cc-2023-06-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2023-06-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2023-06-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2023-06-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2023-06-refined-by-data-juicer) | Redpajama               |
+| Github Code          | 73,208,524 <br>+ 21,387,703 |   49,279,344   |   52.09%   | [redpajama-code-refine.yaml](../configs/data_juicer_recipes/github_code/redpajama-code-refine.yaml)<br>[stack-code-refine.yaml](github_code/stack-code-refine.yaml)<br>[redpajama-stack-code-deduplicate.yaml](github_code/redpajama-stack-code-deduplicate.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-stack-code-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-stack-code-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-stack-code-refined-by-data-juicer)                             | Redpajama<br>The Stack  |
+| StackExchange        |         45,447,328          |   26,309,203   |   57.89%   | [redpajama-pile-stackexchange-refine.yaml](../configs/data_juicer_recipes/redpajama-pile-stackexchange-refine.yaml)                                                                                                                                               | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-pile-stackexchange-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-pile-stackexchange-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-pile-stackexchange-refined-by-data-juicer)             | Redpajama<br>The Pile   |
+| EuroParl             |           69,814            |     61,601     |   88.23%   | [pile-europarl-refine.yaml](../configs/data_juicer_recipes/pile-europarl-refine.yaml)                                                                                                                                                                             | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-europarl-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-europarl-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-europarl-refined-by-data-juicer)                                   | The Pile                |
+| FreeLaw              |          3,562,015          |   2,942,612    |   82.61%   | [pile-freelaw-refine.yaml](../configs/data_juicer_recipes/pile-freelaw-refine.yaml)                                                                                                                                                                               | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-freelaw-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-freelaw-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-freelaw-refined-by-data-juicer)                                     | The Pile                |
+| HackerNews           |           373,027           |    371,331     |   99.55%   | [pile-hackernews-refine.yaml](../configs/data_juicer_recipes/pile-hackernews-refine.yaml)                                                                                                                                                                         | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-hackernews-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-hackernews-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-hackernews-refined-by-data-juicer)                               | The Pile                |
+| NIH ExPorter         |           939,661           |    858,492     |   91.36%   | [pile-nih-refine.yaml](../configs/data_juicer_recipes/pile-nih-refine.yaml)                                                                                                                                                                                       | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-hin-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-nih-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-nih-refined-by-data-juicer)                                             | The Pile                |
+| PhilPapers           |           32,782            |     29,117     |   88.82%   | [pile-philpaper-refine.yaml](../configs/data_juicer_recipes/pile-philpaper-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-philpaper-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-philpaper-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-philpaper-refined-by-data-juicer)                                 | The Pile                |
+| PubMed Abstracts     |         15,518,009          |   15,009,325   |   96.72%   | [pile-pubmed-abstract-refine.yaml](../configs/data_juicer_recipes/pile-pubmed-abstract-refine.yaml)                                                                                                                                                               | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-pubmed-abstract-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-pubmed-abstracts-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-pubmed-abstracts-refined-by-data-juicer)                    | The Pile                |
+| PubMed Central       |          3,098,930          |   2,694,860    |   86.96%   | [pile-pubmed-central-refine.yaml](../configs/data_juicer_recipes/pile-pubmed-central-refine.yaml)                                                                                                                                                                 | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-pubmed-central-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-pubmed-central-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-pubmed-central-refined-by-data-juicer)                        | The Pile                |
+| USPTO                |          5,883,024          |   4,516,283    |   76.77%   | [pile-uspto-refine.yaml](../configs/data_juicer_recipes/pile-uspto-refine.yaml)                                                                                                                                                                                   | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-uspto-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-uspto-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-uspto-refined-by-data-juicer) | The Pile                |
+
+
+## 4. Improved open source text post-processing dataset
+Take the Alpaca-CoT dataset as an example:
+
+| Data subset | Number of samples before improvement | Number of samples after improvement | Sample retention rate | Configuration link | Data link | Source |
+|-------------------|:------------------------:|:----------------------------------:|:---------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------|
+| Alpaca-Cot EN     |       136,219,879        | 72,855,345 |   54.48%   | [alpaca-cot-en-refine.yaml](../configs/data_juicer_recipes/alpaca_cot/alpaca-cot-en-refine.yaml)                                                                                                                                                                   | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-en-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-en-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-en-refined-by-data-juicer)   | [39 subsets from Alpaca-CoT](../configs/data_juicer_recipes/alpaca_cot/README.md) |
+| Alpaca-Cot ZH     |        21,197,246        |             9,873,214              |  46.58%   | [alpaca-cot-zh-refine.yaml](../configs/data_juicer_recipes/alpaca_cot/alpaca-cot-zh-refine.yaml)                                                                                                                                                                   | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-zh-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-zh-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-zh-refined-by-data-juicer)   | [28 subsets from Alpaca-CoT](../configs/data_juicer_recipes/alpaca_cot/README.md) |
+
+## 5. Synthetic contrastive learning image and text datasets
+Data-Juicer has built-in rich operators to support image multimodal data synthesis, such as the Img-Diff dataset. This synthetic data brings a 12-point performance improvement on the MMVP benchmark. For more details, see the Img-Diff [paper](https://arxiv.org/abs/2408.04594), and the corresponding recipe implementation can refer to [ImgDiff-Dev](https://github.com/modelscope/data-juicer/tree/ImgDiff).
+
+## 6. Improved open source image and text datasets
+
+| Data subset | Number of samples before improvement | Number of samples after improvement | Sample retention rate | Configuration link | Data link | Source |
+|---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
+| LLaVA pretrain (LCS-558k) |          558,128          |   500,380    |   89.65%   | [llava-pretrain-refine.yaml](../configs/data_juicer_recipes/llava-pretrain-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/LLaVA-1.5/public/llava-pretrain-refine-result.json) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/llava-pretrain-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/llava-pretrain-refined-by-data-juicer)                                        | [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) |
+| Data-Juicer (T2V, 147k) |          1,217,346          |   147,176    |   12.09%   | [data-juicer-sandbox-optimal.yaml](../configs/data_juicer_recipes/data-juicer-sandbox-optimal.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool)                                        | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (605k)](https://github.com/snap-research/Panda-70M) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
+| Data-Juicer (DJ, 228k) |          3,408,553          |   227,867    |   8.15%   | [data-juicer-sandbox-self-evolution.yaml](../configs/data_juicer_recipes/data-juicer-sandbox-self-evolution.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool_s2.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-evolution-data-pool)                                        | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (2,599k)](https://github.com/snap-research/Panda-70M) <br> [Pexels (198k)](https://github.com/cj-mills/pexels-dataset) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
+
+### 6.1. Evaluation and Verification
+- LLaVA pretrain (LCS-558k): The model pre-trained with **the improved pre-training dataset** and fine-tuned with the original instruction dataset outperformed the baseline model LLaVA-1.5-13B on 10 of the 12 evaluation sets.
+
+| Models | VQAv2 | GQA | VizWiz | SQA | TextVQA | POPE | MME | MM-Bench | MM-Bench-CN | SEED | LLaVA-Bench-Wild | MM-Vet |
+|---------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| LLaVA-1.5-13B <br> (Baseline) | **80.0** | 63.3 | 53.6 | 71.6 | **61.3** | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 |
+| LLaVA-1.5-13B <br> (Rectified Pretraining Dataset) | 79.94 | **63.5** | **54.09** | **74.20** | 60.82 | **86.67** | **1565.53** | **68.2** | **63.9** | **61.8** | **75.9** | **37.4** |
+
+- Data-Juicer (T2V, 147k) and Data-Juicer (DJ, 228k): Outperform the baseline model [T2V-Turbo](https://github.com/Ji4chenLi/t2v-turbo) on [VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard) with **refined dataset**. Here T2V-Turbo is the teacher model of Data-Juicer (T2V, 147k), and Data-Juicer (T2V, 147k) is the teacher model of Data-Juicer (DJ, 228k). For details, please refer to [Sandbox Laboratory](./Sandbox.md).
+  
+
+| model                         | Total Score | Quality Score | Semantic Score | subject consistency | background consistency | temporal flickering | motion smoothness | dynamic degree | aesthetic quality |
+|-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- |
+| T2V-Turbo               | 81.01 | 82.57 | 74.76 | 96.28 | 97.02 | 97.48 | 97.34 | 49.17 | 63.04 |
+| Data-Juicer (T2V, 147k) | 82.10 | 83.14 | 77.93 | 97.32 | 99.03 | 96.60 | 96.51 | **51.67** | **68.92** |
+| Data-Juicer (DJ, 228k)  | **82.53** | **83.38** | **79.13** | **97.92** | **99.27** | **98.14** | **97.77** | 38.89 | 67.39 |
+
+| model                         | imaging quality | object class | multiple objects | human action | color | spatial relationship | scene | appearance style | temporal style | overall consistency |
+|-------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| T2V-Turbo               | **72.49** | 93.96 | 54.65 | 95.20 | 89.90 | 38.67 | 55.58 | 24.42 | 25.51 | 28.16 |
+| Data-Juicer (T2V, 147k) | 70.42 | 95.85 | 61.63 | **95.60** | 94.06 | 46.95 | **57.57** | 24.42 | 26.34 | 28.90 |
+| Data-Juicer (DJ, 228k)  | 70.41 | **96.44** | **64.51** | 95.40 | **95.51** | **47.17** | 57.30 | **25.55** | **26.82** | **29.25** |
+
+## 7. Basic Example Recipes for Video Data
+We provide users with a video dataset processing recipe sample to help better use video-related operators: [general-video-refine-example.yaml](../configs/data_juicer_recipes/general-video-refine-example.yaml) . Here we apply three types of operators:
+- Text-only: Improve the dataset quality based on video description
+- Video-only: Improve the dataset quality based on video properties
+- Text-Video: Improve the dataset quality based on the alignment between text and video
+Users can start their video dataset processing workflow based on this recipe.
+
+## 8. Synthesize a human-centric video review set
+Data-Juicer can also support video review set synthesis, such as [HumanVBench](https://arxiv.org/abs/2412.17574), which converts in-the-wild videos into human-centric video benchmarks. The corresponding data recipes and construction process can be found in [HumanVBench-dev](https://github.com/modelscope/data-juicer/tree/HumanVBench).
+
+## 9. Improve existing open source video datasets
+
+| Data subset | Number of samples before improvement | Number of samples after improvement | Sample retention rate | Configuration link | Data link | Source |
+|---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
+| Data-Juicer (T2V, 147k) |          1,217,346          |   147,176    |   12.09%   | [data-juicer-sandbox-optimal.yaml](../configs/data_juicer_recipes/data-juicer-sandbox-optimal.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool)                                        | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (605k)](https://github.com/snap-research/Panda-70M) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
+| Data-Juicer (DJ, 228k) |          3,408,553          |   227,867    |   8.15%   | [data-juicer-sandbox-self-evolution.yaml](../configs/data_juicer_recipes/data-juicer-sandbox-self-evolution.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool_s2.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-evolution-data-pool)                                        | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (2,599k)](https://github.com/snap-research/Panda-70M) <br> [Pexels (198k)](https://github.com/cj-mills/pexels-dataset) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
+
+### 9.1. Evaluation and Verification
+- Data-Juicer (T2V, 147k) and Data-Juicer (DJ, 228k): Using the **refined dataset**, they fully surpass the baseline model [T2V-Turbo](https://github.com/Ji4chenLi/t2v-turbo) in [VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard). Here, T2V-Turbo is the teacher model of Data-Juicer (T2V, 147k), and Data-Juicer (T2V, 147k) is the teacher model of Data-Juicer (DJ, 228k). For details, please refer to [Sandbox Lab](./Sandbox-ZH.md).
+
+| model                         | Total Score | Quality Score | Semantic Score | subject consistency | background consistency | temporal flickering | motion smoothness | dynamic degree | aesthetic quality |
+|-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- |
+| T2V-Turbo               | 81.01 | 82.57 | 74.76 | 96.28 | 97.02 | 97.48 | 97.34 | 49.17 | 63.04 |
+| Data-Juicer (T2V, 147k) | 82.10 | 83.14 | 77.93 | 97.32 | 99.03 | 96.60 | 96.51 | **51.67** | **68.92** |
+| Data-Juicer (DJ, 228k)  | **82.53** | **83.38** | **79.13** | **97.92** | **99.27** | **98.14** | **97.77** | 38.89 | 67.39 |
+
+| model                         | imaging quality | object class | multiple objects | human action | color | spatial relationship | scene | appearance style | temporal style | overall consistency |
+|-------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| T2V-Turbo               | **72.49** | 93.96 | 54.65 | 95.20 | 89.90 | 38.67 | 55.58 | 24.42 | 25.51 | 28.16 |
+| Data-Juicer (T2V, 147k) | 70.42 | 95.85 | 61.63 | **95.60** | 94.06 | 46.95 | **57.57** | 24.42 | 26.34 | 28.90 |
+| Data-Juicer (DJ, 228k)  | 70.41 | **96.44** | **64.51** | 95.40 | **95.51** | **47.17** | 57.30 | **25.55** | **26.82** | **29.25** |
diff --git a/docs/RecipeGallery_ZH.md b/docs/RecipeGallery_ZH.md
new file mode 100644
index 000000000..1f2868b48
--- /dev/null
+++ b/docs/RecipeGallery_ZH.md
@@ -0,0 +1,127 @@
+# 数据菜谱Gallery
+
+- 菜谱[文件夹](../configs)下包含丰富的Data-Juicer数据菜谱的示例文件,帮助用户轻松理解、复用、扩展各种功能场景下的配置。
+- 📣📣📣 社区贡献者可提PR添加自定义的数据菜谱,促进传播、复用和相关技术演进。我们非常欢迎共建,并会高亮[致谢](https://github.com/modelscope/data-juicer?tab=readme-ov-file#acknowledgement)!
+
+目录
+- [1. Data-Juicer最小示例菜谱](#1-data-juicer最小示例菜谱)
+- [2. 复现开源文本数据集](#2-复现开源文本数据集)
+- [3. 改良开源文本预训练数据集](#3-改良开源文本预训练数据集)
+- [4. 改良开源文本后处理数据集](#4-改良开源文本后处理数据集)
+- [5. 合成对比学习图文数据集](#5-合成对比学习图文数据集)
+- [6. 改良开源图文数据集](#6-改良开源图文数据集)
+  - [6.1. 评测验证](#61-评测验证)
+- [7. 面向视频数据的基础实例菜谱](#7-面向视频数据的基础实例菜谱)
+- [8. 合成以人为中心的视频评测集](#8-合成以人为中心的视频评测集)
+- [9. 改良现有开源视频数据集](#9-改良现有开源视频数据集)
+  - [9.1. 评测验证](#91-评测验证)
+
+## 1. Data-Juicer最小示例菜谱
+[Demo](../configs/demo/)文件夹下放置了一些基础配置文件,用于帮助用户快速熟悉 Data-Juicer 的基本功能,请参阅以获取详细说明。
+
+## 2. 复现开源文本数据集
+- 我们复现了部分 Redpajama 数据集的处理流程,请参阅 [reproduced_redpajama](../configs/reproduced_redpajama) 文件夹以获取详细说明。
+- 我们重现了部分 BLOOM 数据集的处理流程,请参阅 [reproduced_bloom](../configs/reproduced_bloom) 文件夹以获取详细说明。
+
+## 3. 改良开源文本预训练数据集
+
+我们发现在现有的已经处理过的数据集(如 Redpajama、The Pile 等)中仍然存在一些“脏”数据样本。所以我们使用我们的 Data-Juicer 来完善这些数据集,并尝试将它们提供给 LLM 以获得更好的性能。
+
+我们使用简单的 3-σ 规则来设置每个数据处理菜谱中的算子的超参数。
+
+| 数据子集                 |          完善前的样本数目           |    完善后的样本数目    |   样本保留率   | 配置链接                                                                                                                                                                                                                                | 数据链接                                                                                                                                                                                                                                                                                       | 来源                       |
+|----------------------|:---------------------------:|:--------------:|:---------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------|
+| arXiv                |          1,724,497          |   1,655,259    |   95.99%   | [redpajama-arxiv-refine.yaml](../configs/data_juicer_recipes/redpajama-arxiv-refine.yaml)                                                                                                                                                                         | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-arxiv-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-arxiv-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-arxiv-refined-by-data-juicer)                                        | Redpajama               |
+| Books                |           205,182           |    195,983     |   95.51%   | [redpajama-book-refine.yaml](../configs/data_juicer_recipes/redpajama-book-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-book-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-book-refined-by-data-juicer/summary)   <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-book-refined-by-data-juicer)                                        | Redpajama               |
+| Wikipedia            |         29,834,171          |   26,990,659   |   90.47%   | [redpajama-wiki-refine.yaml](../configs/data_juicer_recipes/redpajama-wiki-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-wiki-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-wiki-refined-by-data-juicer/summary)   <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-wiki-refined-by-data-juicer)                                        | Redpajama               |
+| C4                   |         364,868,892         |  344,491,171   |   94.42%   | [redpajama-c4-refine.yaml](../configs/data_juicer_recipes/redpajama-c4-refine.yaml)                                                                                                                                                                               | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-c4-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-c4-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-c4-refined-by-data-juicer)                                             | Redpajama               |
+| Common Crawl 2019-30 |         81,085,420          |   36,557,283   |   45.08%   | [redpajama-cc-2019-30-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2019-30-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2019-30-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2019-30-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2019-30-refined-by-data-juicer)  | Redpajama               |
+| Common Crawl 2020-05 |         90,850,492          |   42,612,596   |   46.90%   | [redpajama-cc-2020-05-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2020-05-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2020-05-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2020-05-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2020-05-refined-by-data-juicer)  | Redpajama               |
+| Common Crawl 2021-04 |         98,878,523          |   44,724,752   |   45.23%   | [redpajama-cc-2021-04-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2021-04-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2021-04-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2021-04-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2021-04-refined-by-data-juicer)  | Redpajama               |
+| Common Crawl 2022-05 |         94,058,868          |   42,648,496   |   45.34%   | [redpajama-cc-2022-05-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2022-05-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2022-05-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2022-05-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2022-05-refined-by-data-juicer)  | Redpajama               |
+| Common Crawl 2023-06 |         111,402,716         |   50,643,699   |   45.46%   | [redpajama-cc-2023-06-refine.yaml](../configs/data_juicer_recipes/redpajama-cc-2023-06-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2023-06-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-cc-2023-06-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-cc-2023-06-refined-by-data-juicer) | Redpajama               |
+| Github Code          | 73,208,524 <br>+ 21,387,703 |   49,279,344   |   52.09%   | [redpajama-code-refine.yaml](../configs/data_juicer_recipes/github_code/redpajama-code-refine.yaml)<br>[stack-code-refine.yaml](github_code/stack-code-refine.yaml)<br>[redpajama-stack-code-deduplicate.yaml](github_code/redpajama-stack-code-deduplicate.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-stack-code-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-stack-code-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-stack-code-refined-by-data-juicer)                             | Redpajama<br>The Stack  |
+| StackExchange        |         45,447,328          |   26,309,203   |   57.89%   | [redpajama-pile-stackexchange-refine.yaml](../configs/data_juicer_recipes/redpajama-pile-stackexchange-refine.yaml)                                                                                                                                               | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-pile-stackexchange-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/redpajama-pile-stackexchange-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/redpajama-pile-stackexchange-refined-by-data-juicer)             | Redpajama<br>The Pile   |
+| EuroParl             |           69,814            |     61,601     |   88.23%   | [pile-europarl-refine.yaml](../configs/data_juicer_recipes/pile-europarl-refine.yaml)                                                                                                                                                                             | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-europarl-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-europarl-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-europarl-refined-by-data-juicer)                                   | The Pile                |
+| FreeLaw              |          3,562,015          |   2,942,612    |   82.61%   | [pile-freelaw-refine.yaml](../configs/data_juicer_recipes/pile-freelaw-refine.yaml)                                                                                                                                                                               | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-freelaw-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-freelaw-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-freelaw-refined-by-data-juicer)                                     | The Pile                |
+| HackerNews           |           373,027           |    371,331     |   99.55%   | [pile-hackernews-refine.yaml](../configs/data_juicer_recipes/pile-hackernews-refine.yaml)                                                                                                                                                                         | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-hackernews-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-hackernews-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-hackernews-refined-by-data-juicer)                               | The Pile                |
+| NIH ExPorter         |           939,661           |    858,492     |   91.36%   | [pile-nih-refine.yaml](../configs/data_juicer_recipes/pile-nih-refine.yaml)                                                                                                                                                                                       | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-hin-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-nih-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-nih-refined-by-data-juicer)                                             | The Pile                |
+| PhilPapers           |           32,782            |     29,117     |   88.82%   | [pile-philpaper-refine.yaml](../configs/data_juicer_recipes/pile-philpaper-refine.yaml)                                                                                                                                                                           | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-philpaper-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-philpaper-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-philpaper-refined-by-data-juicer)                                 | The Pile                |
+| PubMed Abstracts     |         15,518,009          |   15,009,325   |   96.72%   | [pile-pubmed-abstract-refine.yaml](../configs/data_juicer_recipes/pile-pubmed-abstract-refine.yaml)                                                                                                                                                               | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-pubmed-abstract-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-pubmed-abstracts-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-pubmed-abstracts-refined-by-data-juicer)                    | The Pile                |
+| PubMed Central       |          3,098,930          |   2,694,860    |   86.96%   | [pile-pubmed-central-refine.yaml](../configs/data_juicer_recipes/pile-pubmed-central-refine.yaml)                                                                                                                                                                 | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-pubmed-central-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-pubmed-central-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-pubmed-central-refined-by-data-juicer)                        | The Pile                |
+| USPTO                |          5,883,024          |   4,516,283    |   76.77%   | [pile-uspto-refine.yaml](../configs/data_juicer_recipes/pile-uspto-refine.yaml)                                                                                                                                                                                   | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-uspto-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-uspto-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-uspto-refined-by-data-juicer) | The Pile                |
+
+
+## 4. 改良开源文本后处理数据集
+以Alpaca-CoT数据集为例:
+
+| 数据子集              |         完善前的样本数目         |              完善后的样本数目              |   样本保留率   | 配置链接                                                                                                                                                                                                                                | 数据链接                                                                                                                                                                                                                                     | 来源                                            |
+|-------------------|:------------------------:|:----------------------------------:|:---------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------|
+| Alpaca-Cot EN     |       136,219,879        | 72,855,345 |   54.48%   | [alpaca-cot-en-refine.yaml](../configs/data_juicer_recipes/alpaca_cot/alpaca-cot-en-refine.yaml)                                                                                                                                                                   | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-en-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-en-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-en-refined-by-data-juicer)   | [来自Alpaca-CoT的39个子集](../configs/data_juicer_recipes/alpaca_cot/README_ZH.md#完善的-alpaca-cot-数据集元信息) |
+| Alpaca-Cot ZH     |        21,197,246        |             9,873,214              |  46.58%   | [alpaca-cot-zh-refine.yaml](../configs/data_juicer_recipes/alpaca_cot/alpaca-cot-zh-refine.yaml)                                                                                                                                                                   | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-zh-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-zh-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-zh-refined-by-data-juicer)   | [来自Alpaca-CoT的28个子集](../configs/data_juicer_recipes/alpaca_cot/README_ZH.md#完善的-alpaca-cot-数据集元信息) |
+
+## 5. 合成对比学习图文数据集
+Data-Juicer内置了丰富的算子来支持图片多模态数据合成,譬如Img-Diff数据集。该合成数据在MMVP基准上带来了12个性能点的模型提升。更多细节参见Img-Diff[论文](https://arxiv.org/abs/2408.04594),对应菜谱实现可参考[ImgDiff-Dev](https://github.com/modelscope/data-juicer/tree/ImgDiff).
+
+
+## 6. 改良开源图文数据集
+
+| 数据子集                    |      完善前的样本数目       | 完善后的样本数目 | 样本保留率 | 配置链接                          | 数据链接                                                                                                                                                                                                                                                                                 | 来源            |
+|---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
+| LLaVA pretrain (LCS-558k) |          558,128          |   500,380    |   89.65%   | [llava-pretrain-refine.yaml](../configs/data_juicer_recipes/llava-pretrain-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/LLaVA-1.5/public/llava-pretrain-refine-result.json) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/llava-pretrain-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/llava-pretrain-refined-by-data-juicer)                                        | [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) |
+| Data-Juicer (T2V, 147k) |          1,217,346          |   147,176    |   12.09%   | [data-juicer-sandbox-optimal.yaml](../configs/data_juicer_recipes/data-juicer-sandbox-optimal.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool)                                        | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (605k)](https://github.com/snap-research/Panda-70M) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
+| Data-Juicer (DJ, 228k) |          3,408,553          |   227,867    |   8.15%   | [data-juicer-sandbox-self-evolution.yaml](../configs/data_juicer_recipes/data-juicer-sandbox-self-evolution.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool_s2.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-evolution-data-pool)                                        | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (2,599k)](https://github.com/snap-research/Panda-70M) <br> [Pexels (198k)](https://github.com/cj-mills/pexels-dataset) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
+
+### 6.1. 评测验证
+- LLaVA pretrain (LCS-558k): 使用**完善后的预训练数据集**预训练并使用原始的指令数据集微调后的模型在12个评测集上有10个超过了基线模型LLaVA-1.5-13B。
+
+| 模型                              | VQAv2 | GQA | VizWiz | SQA | TextVQA | POPE | MME | MM-Bench | MM-Bench-CN | SEED | LLaVA-Bench-Wild | MM-Vet |
+|---------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| LLaVA-1.5-13B <br> (基线)         | **80.0**  | 63.3 | 53.6 | 71.6 | **61.3** | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 |
+| LLaVA-1.5-13B <br> (完善后的预训练数据集) | 79.94 | **63.5** | **54.09** | **74.20** | 60.82 | **86.67** | **1565.53** | **68.2** | **63.9** | **61.8** | **75.9** | **37.4** |
+
+- Data-Juicer (T2V, 147k) 和 Data-Juicer (DJ, 228k): 使用**完善后的数据集**在 [VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard) 全面超过基线模型 [T2V-Turbo](https://github.com/Ji4chenLi/t2v-turbo)。这里 T2V-Turbo 是 Data-Juicer (T2V, 147k) 的teacher模型,Data-Juicer (T2V, 147k) 是 Data-Juicer (DJ, 228k) 的teacher模型,详情请参考[沙盒实验室](./Sandbox-ZH.md)。
+
+| model                         | Total Score | Quality Score | Semantic Score | subject consistency | background consistency | temporal flickering | motion smoothness | dynamic degree | aesthetic quality |
+|-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- |
+| T2V-Turbo               | 81.01 | 82.57 | 74.76 | 96.28 | 97.02 | 97.48 | 97.34 | 49.17 | 63.04 |
+| Data-Juicer (T2V, 147k) | 82.10 | 83.14 | 77.93 | 97.32 | 99.03 | 96.60 | 96.51 | **51.67** | **68.92** |
+| Data-Juicer (DJ, 228k)  | **82.53** | **83.38** | **79.13** | **97.92** | **99.27** | **98.14** | **97.77** | 38.89 | 67.39 |
+
+| model                         | imaging quality | object class | multiple objects | human action | color | spatial relationship | scene | appearance style | temporal style | overall consistency |
+|-------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| T2V-Turbo               | **72.49** | 93.96 | 54.65 | 95.20 | 89.90 | 38.67 | 55.58 | 24.42 | 25.51 | 28.16 |
+| Data-Juicer (T2V, 147k) | 70.42 | 95.85 | 61.63 | **95.60** | 94.06 | 46.95 | **57.57** | 24.42 | 26.34 | 28.90 |
+| Data-Juicer (DJ, 228k)  | 70.41 | **96.44** | **64.51** | 95.40 | **95.51** | **47.17** | 57.30 | **25.55** | **26.82** | **29.25** |
+
+## 7. 面向视频数据的基础实例菜谱
+我们为用户提供了一个视频数据集处理菜谱样例以协助更好地使用视频相关的算子: [general-video-refine-example.yaml](../configs/data_juicer_recipes/general-video-refine-example.yaml) 。这里我们应用了三种类型的算子:
+- 仅文本:根据视频描述提高数据集质量
+- 仅视频:根据视频性质提高数据集质量
+- 文本-视频:根据文本和视频间的对齐提高数据集质量
+用户可以基于这个菜谱开始他们的视频数据集处理流程。
+
+## 8. 合成以人为中心的视频评测集
+Data-Juicer还可以支持视频评测集合成,如[HumanVBench](https://arxiv.org/abs/2412.17574),其将in-the-wild视频转化为以人为中心的视频评测集),对应的数据菜谱和构造流程可参考[HumanVBench-dev](https://github.com/modelscope/data-juicer/tree/HumanVBench)。
+
+## 9. 改良现有开源视频数据集
+
+| 数据子集                    |      完善前的样本数目       | 完善后的样本数目 | 样本保留率 | 配置链接                          | 数据链接                                                                                                                                                                                                                                                                                 | 来源            |
+|---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
+| Data-Juicer (T2V, 147k) |          1,217,346          |   147,176    |   12.09%   | [data-juicer-sandbox-optimal.yaml](../configs/data_juicer_recipes/data-juicer-sandbox-optimal.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool)                                        | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (605k)](https://github.com/snap-research/Panda-70M) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
+| Data-Juicer (DJ, 228k) |          3,408,553          |   227,867    |   8.15%   | [data-juicer-sandbox-self-evolution.yaml](../configs/data_juicer_recipes/data-juicer-sandbox-self-evolution.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool_s2.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-evolution-data-pool)                                        | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (2,599k)](https://github.com/snap-research/Panda-70M) <br> [Pexels (198k)](https://github.com/cj-mills/pexels-dataset) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
+
+### 9.1. 评测验证
+- Data-Juicer (T2V, 147k) 和 Data-Juicer (DJ, 228k): 使用**完善后的数据集**在 [VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard) 全面超过基线模型 [T2V-Turbo](https://github.com/Ji4chenLi/t2v-turbo)。这里 T2V-Turbo 是 Data-Juicer (T2V, 147k) 的teacher模型,Data-Juicer (T2V, 147k) 是 Data-Juicer (DJ, 228k) 的teacher模型,详情请参考[沙盒实验室](./Sandbox-ZH.md)。
+
+| model                         | Total Score | Quality Score | Semantic Score | subject consistency | background consistency | temporal flickering | motion smoothness | dynamic degree | aesthetic quality |
+|-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- |
+| T2V-Turbo               | 81.01 | 82.57 | 74.76 | 96.28 | 97.02 | 97.48 | 97.34 | 49.17 | 63.04 |
+| Data-Juicer (T2V, 147k) | 82.10 | 83.14 | 77.93 | 97.32 | 99.03 | 96.60 | 96.51 | **51.67** | **68.92** |
+| Data-Juicer (DJ, 228k)  | **82.53** | **83.38** | **79.13** | **97.92** | **99.27** | **98.14** | **97.77** | 38.89 | 67.39 |
+
+| model                         | imaging quality | object class | multiple objects | human action | color | spatial relationship | scene | appearance style | temporal style | overall consistency |
+|-------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| T2V-Turbo               | **72.49** | 93.96 | 54.65 | 95.20 | 89.90 | 38.67 | 55.58 | 24.42 | 25.51 | 28.16 |
+| Data-Juicer (T2V, 147k) | 70.42 | 95.85 | 61.63 | **95.60** | 94.06 | 46.95 | **57.57** | 24.42 | 26.34 | 28.90 |
+| Data-Juicer (DJ, 228k)  | 70.41 | **96.44** | **64.51** | 95.40 | **95.51** | **47.17** | 57.30 | **25.55** | **26.82** | **29.25** |