Refactor and improve doc for RecipeGallery, DeveloperGuide, Distribut…

…edProcess and DJ-related Competitions (#561) * 1. refactor doc for RecipeGallery; 2. improve the doc for developer guide 3. some typo fix, and suitable overview fig size; 4. add link to the added data resplit tool * add use cases for DJ related competitions * in use case, add agentscope * remove [] * unify commas * fix TOC rendering error * fix spaces and en version * fix bad link * suitable overview fig size in homepage
modelscope · Jan 22, 2025 · 84dbc78 · 84dbc78
1 parent bef0644
commit 84dbc78
Show file tree

Hide file tree

Showing 12 changed files with 564 additions and 465 deletions.
diff --git a/README.md b/README.md
@@ -28,7 +28,7 @@
 
 
 Data-Juicer is a one-stop system to process text and multimodal data for and with foundation models (typically LLMs).
-We provide a [playground](http://8.138.149.181/) with a managed JupyterLab. [Try Data-Juicer](http://8.138.149.181/) straight away in your browser! If you find Data-Juicer useful for your research or development, please kindly support us by starting it (then be instantly notified of our new releases) and citing our [work](#references).
+We provide a [playground](http://8.138.149.181/) with a managed JupyterLab. [Try Data-Juicer](http://8.138.149.181/) straight away in your browser! If you find Data-Juicer useful for your research or development, please kindly support us by starting it (then be instantly notified of our new releases) and citing our [works](#references).
 
 [Platform for AI of Alibaba Cloud (PAI)](https://www.aliyun.com/product/bigdata/learn) has cited our work and integrated Data-Juicer into its data processing products. PAI is an AI Native large model and AIGC engineering platform that provides dataset management, computing power management, model tool chain, model development, model training, model deployment, and AI asset management. For documentation on data processing, please refer to: [PAI-Data Processing for Large Models](https://help.aliyun.com/zh/pai/user-guide/components-related-to-data-processing-for-foundation-models/?spm=a2c4g.11186623.0.0.3e9821a69kWdvX).
 
@@ -101,7 +101,7 @@ Table of Contents
 
 ## Why Data-Juicer?
 
-![Overview](https://img.alicdn.com/imgextra/i2/O1CN01EteoQ31taUweAW1UE_!!6000000005918-2-tps-4034-4146.png)
+<img src="https://img.alicdn.com/imgextra/i2/O1CN01EteoQ31taUweAW1UE_!!6000000005918-2-tps-4034-4146.png" align="center" width="600" />
 
 - **Systematic & Reusable**:
   Empowering users with a systematic library of 100+ core [OPs](docs/Operators.md), and 50+ reusable config recipes and 
@@ -121,35 +121,51 @@ Table of Contents
 
 
 ## DJ-Cookbook
+
 ### Curated Resources
 - [KDD-Tutorial](https://modelscope.github.io/data-juicer/_static/tutorial_kdd24.html)
 - [Awesome LLM-Data](docs/awesome_llm_data.md)
 - ["Bad" Data Exhibition](docs/BadDataExhibition.md)
 
+
 ### Coding with Data-Juicer (DJ)
-- [Overview of DJ](README.md)
-- [Operator Zoo](docs/Operators.md)
-- [Quick Start](#quick-start)
-- [Configuration](configs/README.md)
-- [Developer Guide](docs/DeveloperGuide.md)
-- [API references](https://modelscope.github.io/data-juicer/)
-- [Preprocess Tools](tools/preprocess/README.md)
-- [Postprocess Tools](tools/postprocess/README.md)
-- [Format Conversion](tools/fmt_conversion/README.md)
-- [Sandbox](docs/Sandbox.md)
-- [Quality Classifier](tools/quality_classifier/README.md)
-- [Auto Evaluation](tools/evaluator/README.md)
-- [Third-parties Integration](thirdparty/LLM_ecosystems/README.md)
+- Basics
+  - [Overview of DJ](README.md)
+  - [Quick Start](#quick-start)
+  - [Configuration](docs/RecipeGallery.md)
+  - [Data Format Conversion](tools/fmt_conversion/README.md)
+- Lookup Materials
+  - [DJ OperatorZoo](docs/Operators.md)
+  - [API references](https://modelscope.github.io/data-juicer/)
+- Advanced
+  - [Developer Guide](docs/DeveloperGuide.md)
+  - [Preprocess Tools](tools/preprocess/README.md)
+  - [Postprocess Tools](tools/postprocess/README.md)
+  - [Sandbox](docs/Sandbox.md)
+  - [Quality Classifier](tools/quality_classifier/README.md)
+  - [Auto Evaluation](tools/evaluator/README.md)
+  - [Third-parties Integration](thirdparty/LLM_ecosystems/README.md)
+
 
 ### Use Cases & Data Recipes
-- [Recipes for data process in BLOOM](configs/reproduced_bloom/README.md)
-- [Recipes for data process in RedPajama](configs/reproduced_redpajama/README.md)
-- [Refined recipes for pre-training text data](configs/data_juicer_recipes/README.md)
-- [Refined recipes for fine-tuning text data](configs/data_juicer_recipes/README.md#before-and-after-refining-for-alpaca-cot-dataset)
-- [Refined recipes for pre-training multi-modal data](configs/data_juicer_recipes/README.md#before-and-after-refining-for-multimodal-dataset)
+- [Data Recipe Gallery](docs/RecipeGallery.md)
+  - Data-Juicer Minimal Example Recipe
+  - Reproducing Open Source Text Datasets
+  - Improving Open Source Text Pre-training Datasets
+  - Improving Open Source Text Post-processing Datasets
+  - Synthetic Contrastive Learning Image-text Datasets
+  - Improving Open Source Image-text Datasets
+  - Basic Example Recipes for Video Data
+  - Synthesizing Human-centered Video Evaluation Sets
+  - Improving Existing Open Source Video Datasets
+- Data-Juicer related Competitions
+  - [Better Synth](https://tianchi.aliyun.com/competition/entrance/532251), explore the impact of large model synthetic data on image understanding ability with DJ-Sandbox Lab and multimodal large models
+  - [Modelscope-Sora Challenge](https://tianchi.aliyun.com/competition/entrance/532219), based on Data-Juicer and [EasyAnimate](https://github.com/aigc-apps/EasyAnimate) framework,  optimize data and train SORA-like small models to generate better videos
+  - [Better Mixture](https://tianchi.aliyun.com/competition/entrance/532174), only adjust data mixing and sampling strategies for given multiple candidate datasets
+  - FT-Data Ranker ([1B Track](https://tianchi.aliyun.com/competition/entrance/532157), [7B Track](https://tianchi.aliyun.com/competition/entrance/532158)), For a specified candidate dataset, only adjust the data filtering and enhancement strategies
+  - [Kolors-LoRA Stylized Story Challenge](https://tianchi.aliyun.com/competition/entrance/532254), based on Data-Juicer and [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio) framework, explore Difussion model fine-tuning
 - [DJ-SORA](docs/DJ_SORA.md)
-- [Agentic Filters of DJ](./demos/api_service/react_data_filter_process.ipynb)
-- [Agentic Mappers of DJ](./demos/api_service/react_data_mapper_process.ipynb)
+- Based on Data-Juicer and [AgentScope](https://github.com/modelscope/agentscope) framework, leverage [agents to call DJ Filters](./demos/api_service/react_data_filter_process.ipynb) and [call DJ Mappers](./demos/api_service/react_data_mapper_process.ipynb)
 
 
 ### Interactive Examples
@@ -466,7 +482,7 @@ features, bug fixes, and better documentation. Please refer to
 Data-Juicer is used across various foundation model applications and research initiatives, such as industrial scenarios in Alibaba Tongyi and Alibaba Cloud's platform for AI (PAI).
 We look forward to more of your experience, suggestions, and discussions for collaboration!
 
-Data-Juicer thanks many community [contributers](https://github.com/modelscope/data-juicer/graphs/contributors) and open-source projects, such as
+Data-Juicer thanks many community [contributors](https://github.com/modelscope/data-juicer/graphs/contributors) and open-source projects, such as
 [Huggingface-Datasets](https://github.com/huggingface/datasets), [Bloom](https://huggingface.co/bigscience/bloom), [RedPajama](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1), [Arrow](https://github.com/apache/arrow), [Ray](https://github.com/ray-project/ray), ....
 
 

diff --git a/README_ZH.md b/README_ZH.md
@@ -94,7 +94,7 @@ Data-Juicer正在积极更新和维护中，我们将定期强化和新增更多
 
 ## 为什么选择 Data-Juicer？
 
-![概述](https://img.alicdn.com/imgextra/i2/O1CN01EteoQ31taUweAW1UE_!!6000000005918-2-tps-4034-4146.png)
+<img src="https://img.alicdn.com/imgextra/i2/O1CN01EteoQ31taUweAW1UE_!!6000000005918-2-tps-4034-4146.png" align="center" width="600" />
 
 - **系统化和可重用**：
 系统化地为用户提供 100 多个核心 [算子](docs/Operators.md) 和 50 多个可重用的数据菜谱和
@@ -116,30 +116,43 @@ Data-Juicer正在积极更新和维护中，我们将定期强化和新增更多
 - [“坏”数据展览](docs/BadDataExhibition_ZH.md)
 
 ### 编写Data-Juicer (DJ) 代码
-- [DJ概览](README_ZH.md)
-- [算子库](docs/Operators.md)
-- [快速上手](#快速上手)
-- [配置](configs/README_ZH.md)
-- [开发者指南](docs/DeveloperGuide_ZH.md)
-- [API参考](https://modelscope.github.io/data-juicer/)
-- [预处理工具](tools/preprocess/README_ZH.md)
-- [后处理工具](tools/postprocess/README_ZH.md)
-- [格式转换](tools/fmt_conversion/README_ZH.md)
-- [沙盒](docs/Sandbox-ZH.md)
-- [质量分类器](tools/quality_classifier/README_ZH.md)
-- [自动评估](tools/evaluator/README_ZH.md)
-- [第三方集成](thirdparty/LLM_ecosystems/README_ZH.md)
+- 基础
+  - [DJ概览](README_ZH.md)
+  - [快速上手](#快速上手)
+  - [配置](docs/RecipeGallery_ZH.md)
+  - [数据格式转换](tools/fmt_conversion/README_ZH.md)
+- 信息速查
+  - [算子库](docs/Operators.md)
+  - [API参考](https://modelscope.github.io/data-juicer/)
+- 进阶
+  - [开发者指南](docs/DeveloperGuide_ZH.md)
+  - [预处理工具](tools/preprocess/README_ZH.md)
+  - [后处理工具](tools/postprocess/README_ZH.md)
+  - [沙盒](docs/Sandbox-ZH.md)
+  - [质量分类器](tools/quality_classifier/README_ZH.md)
+  - [自动评估](tools/evaluator/README_ZH.md)
+  - [第三方集成](thirdparty/LLM_ecosystems/README_ZH.md)
 
 ### 用例与数据菜谱
-
-* [BLOOM 数据处理菜谱](configs/reproduced_bloom/README_ZH.md)
-* [RedPajama 数据处理菜谱](configs/reproduced_redpajama/README_ZH.md)
-* [预训练文本数据增强菜谱](configs/data_juicer_recipes/README_ZH.md)
-* [Fine-tuning文本数据增强菜谱](configs/data_juicer_recipes/README_ZH.md#完善前后的alpaca-cot数据集)
-* [预训练多模态数据增强菜谱](configs/data_juicer_recipes/README_ZH.md#before-and-after-refining-for-multimodal-dataset)
+* [数据菜谱Gallery](docs/RecipeGallery.md)
+  - Data-Juicer 最小示例配方
+  - 复现开源文本数据集
+  - 改进开源文本预训练数据集
+  - 改进开源文本后处理数据集
+  - 合成对比学习图像文本数据集
+  - 改进开源图像文本数据集
+  - 视频数据的基本示例菜谱
+  - 合成以人为中心的视频评测集
+  - 改进现有的开源视频数据集
+* Data-Juicer相关竞赛
+  - [Better Synth](https://tianchi.aliyun.com/competition/entrance/532251)，在DJ-沙盒实验室和多模态大模型上，探索大模型合成数据对图像理解能力的影响
+  - [Modelscope-Sora挑战赛](https://tianchi.aliyun.com/competition/entrance/532219)，基于Data-Juicer和[EasyAnimate](https://github.com/aigc-apps/EasyAnimate)框架，调优文本-视频数据集，在类SORA小模型上训练以生成更好的视频
+  - [Better Mixture](https://tianchi.aliyun.com/competition/entrance/532174)，针对指定多个候选数据集，仅调整数据混合和采样策略
+  - FT-Data Ranker ([1B Track](https://tianchi.aliyun.com/competition/entrance/532157)、 [7B Track](https://tianchi.aliyun.com/competition/entrance/532158))，针对指定候选数据集，仅调整数据过滤和增强策略
+  - [可图Kolors-LoRA风格故事挑战赛](https://tianchi.aliyun.com/competition/entrance/532254)，基于Data-Juicer和[DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio)框架，探索Difussion模型微调
 * [DJ-SORA](docs/DJ_SORA_ZH.md)
-* [智能体调用DJ Filters](./demos/api_service/react_data_filter_process.ipynb)
-* [智能体调用DJ Mappers](./demos/api_service/react_data_mapper_process.ipynb)
+* 基于Data-Juicer和[AgentScope](https://github.com/modelscope/agentscope)框架，通过[智能体调用DJ Filters](./demos/api_service/react_data_filter_process.ipynb)和[调用DJ Mappers](./demos/api_service/react_data_mapper_process.ipynb)
+  
 
 
 ### 交互类示例
@@ -449,7 +462,7 @@ Data-Juicer 在 Apache License 2.0 协议下发布。
 
 ## 致谢
 
-Data-Juicer被许多大模型相关产品和研究工作所使用，例子阿里巴巴通义和阿里云人工智能平台 (PAI) 之上的工业界场景。 我们期待更多您的体验反馈、建议和合作共建！
+Data-Juicer被许多大模型相关产品和研究工作所使用，例如阿里巴巴通义和阿里云人工智能平台 (PAI) 之上的工业界场景。 我们期待更多您的体验反馈、建议和合作共建！
 
 
 Data-Juicer 感谢社区[贡献者](https://github.com/modelscope/data-juicer/graphs/contributors) 和相关的先驱开源项目，譬如[Huggingface-Datasets](https://github.com/huggingface/datasets), [Bloom](https://huggingface.co/bigscience/bloom), [RedPajama](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1), [Arrow](https://github.com/apache/arrow), [Ray](https://github.com/ray-project/ray), ....

diff --git a/configs/README.md b/configs/README.md
diff --git a/configs/README_ZH.md b/configs/README_ZH.md