28 Feb 07:50

HYLcool

6014bcc

Release v1.2.1 Latest

Latest

Major Updates

DJ has been integrated in Ray's official Ecosystem and Example Gallery. Besides, our patch in DJ2.0 for the streaming JSON reader has been officially integrated by Apache Arrow.
Our work on contrastive data synthesis, ImgDiff, has been accepted by CVPR 2025!
Unit test optimization:
- split unit tests to partial and regression: partial test is triggered by PR and only test on corresponding test cases of changed files; regression test on all cases and triggered at 7:00 on every Friday in Beijing time. #598
- use primitive @unittest.skip and remove SKIPPED_TESTS. #586
- upload test coverage reports to GitHub artifacts. #586

New OPs

image_remove_background_mapper: remove the background of images. #589

Others

add missing LOADED_AUDIOS to ALL_INTER_VARS to enable OP fusion and context sharing. #585
only build doc for py3.10. #586
move dependency on ray to minimal requirements. #586 #594 #595
allow executor and other tool functions to consume a loaded dataset in addition to the config file. #596 #597
fix undefined fileno bug of the logger. #594

Acknowledgement

@liuyuhanalex helps simplify the code logic of OP fusion, add a new OP image_remove_background_mapper, and fix some minor bugs. #581 #585 #589
@co63oc helps to fix typos in code and documents. #582 #583 #588 #591 #593
@danielhjz helps to fix the implicit memory leak problem in image_nsfw_filter. #590

Contributors

co63oc, danielhjz, and liuyuhanalex

Assets 3

14 Feb 09:40

yxdyc

v1.2.0

7820a4d

v1.2.0 Doc refactored; New algorithm proposed

What's New

📚 The DJ doc is refactored and improved, e.g., RecipeGallery, DeveloperGuide, DistributedProcess, DJ-related Competitions, typos bad links
🔎 More unit-tests added.
🎛 The data pre-split and export are improved.
🔮 A new data selection method, DaaR, is proposed. See Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data.

Detailed PRs

fix export error when export_stats columns is null in #557
Resplit input dataset in ray mode in #549
Refactor and improve doc for RecipeGallery, DeveloperGuide, DistributedProcess and DJ-related Competitions in #561
Resolve most skipped unit-tests by in #559
fix translation error in #562
Add unittest for ray text dedup in #540
[Typo]correct a small typo in #563
update the 2.0 paper link & the DaaR news in #566
Fix typos in #571
Optimization for sdxl_prompt2prompt_mapper dependency importing by in #570
Fix typos in #572

Acknowledgment

@liuyuhanalex @co63oc made their first PRs

Full Changelog: v1.1.0...v1.2.0

Contributors

co63oc and liuyuhanalex

Assets 3

17 Jan 09:46

BeachWang

v1.1.0

030e786

Release v1.1.0

Major Updates

🧪 User now can run ray-based distributed data processing under the guidance of added docs. #523
🧪 The DJ-Cookbook has gathered numerous high-quality data processing recipes from various vertical fields, and the related documents have been updated on the homepage. #542
💥 Change Task mode to Actor mode for ray deduplication, allowing users to use these operators without installing Redis. #526
🚀 Append a log summarization for warnings and errors at the running ending to make them recognizable under the sample fault tolerance mechanism. #534
🚀 Automatically update relevant documents when adding OPs to reduce the development burden. #527
🛝 Add usability tags for OPs:
- alpha tag for OPs in which only the basic OP implementations are finished;
- beta tag for OPs in which unittests are added based on the alpha version;
- stable tag for OPs in which OP optimizations related to DJ (e.g. model management, batched processing, OP fusion, ...) are added based on the beta version.

New OPs

image_segment_mapper: Perform segment-anything on images and return the bounding boxes. #550
mllm_mapper: Mapper to use MLLMs to generate texts for images. #550
sdxl_prompt2prompt_mapper: Use the generative model SDXL and image editing technique Prompt-to-Prompt to generate pairs of similar images. #550
sentence_augmentation_mapper: Augment sentences using LLMs. #550
text_pair_similarity_filter: Filter samples according to the similarity score between the text pair. #550

Bug Fixed

Add global skip_op_error param to enable fault-tolerant when execute DataJuicer analyzer and executor, but disable fault-tolerant for unit test. #528
Fix model force download bug. #529
Fix IndexError if the number of samples in the result dataset is less than the number of workers when saving dataset to disk. #536
Fix missing field meta tag on ray mode. #538
Update max_tokens or max_new_tokens for vllm-based OPs to avoid too short generation. #544
Fix bug in the role playing data generation demo. #545

Others

Enhance unit test for API calling OPs. #528
Remove sandbox requirements installation from Dockerfile. #530
Update the datasource related APIs to be compatible with the latest version of Ray. #532
Limit the generated qa num for each text in generate_qa_from_text_mapper. #541
Update docs for preparing DJ2.0 release. #542
Update a quick cdn link for arch figure. #543
Add a video demo for role playing data generation. #545
Optimize op doc for global textual search. #552
Use a more stable and fast translator than google translator for automatic OP doc building. #554

Acknowledgement

@Qirui-jiao made great contributions to enrich the Data-Juicer OP pool. #550

Contributors

Qirui-jiao

Assets 3

03 Jan 10:59

HYLcool

v1.0.3

87efd5e

Release v1.0.3: More Powerful Distributed MinHashLSH Deduplicator; Post-Tuning Formats & OPs; Ray Actor for GPU-based OPs

Major Updates

💥 Support Ray-based MinHashLSH deduplicator, which implemented a multi-process Union-Find set based on Ray Actor and BTS algorithm to complete equivalence class merging. #502
💥 Support post-tuning dataset formats in LLaMA-Factory and ModelScope-Swift.
- Data-Juicer chooses the Query-Response format as the intermediate format for the post-tuning dataset. #514
- Refine the overall intermediate format of Data-Juicer to support various dataset formats better. (meta, stats) #514 #518
- Provide several format conversion tools for converting to Data-Juicer format and vice versa. #514
🚀 Add 10 more post-tuning OPs to process post-tuning datasets better. It's listed in detail in the below New OPs section. #513
🚀 Support Ray Actor mode for GPU-based OPs. #511

New OPs

Post-tuning OPs for fine-grained analysis of dialog data. #513

Mapper

dialog_intent_detection_mapper: Mapper to generate user's intent labels in feed back dialog data.
dialog_sentiment_detection_mapper: Mapper to generate user's sentiment labels in feed back dialog data.
dialog_sentiment_intensity_mapper: Mapper to predict user's sentiment intensity (from -5 to 5 in default
prompt) in feed back dialog data.
dialog_topic_detection_mapper: Mapper to generate user's topic labels in feed back dialog data.
query_intent_detection_mapper: Mapper to predict user's Intent label in a query.
query_sentiment_detection_mapper: Mapper to predict user's sentiment label ('negative', 'neutral' and
'positive') in a query.
query_topic_detection_mapper: Mapper to predict user's topic label in a query.

Aggregator

meta_tags_aggregator: Merge similar meta tags to one tag.

Selector

tags_specified_field_selector: Select samples based on the tags of specified field.

Grouper

naive_reverse_grouper: Split bathed sample to samples.

Bug Fixed

Fix the wrong argument passing in generate_qa_from_example_mapper. #517
Update the out-of-date Dingding QR code on the main page. #513

Acknowledgement

@jackylee-ch made their first contribution to help fix several invalid links in the document. #521

Full Changelog: v1.0.2...v1.0.3

Contributors

jackylee-ch

Assets 3

20 Dec 12:15

yxdyc

v1.0.2

a26dcc7

Release v1.0.2

Major Updates

Added more mapper/grouper/aggregator OPs for post-tuning scenarios.
Optimized the distributed mode performance and usability with more automatic features.

DJ-Operators

extract_support_text_mapper, relation_identity_mapper, python_file_mapper, #500
naive_grouper, key_value_grouper, #500
nested_aggregator, entity_attribute_aggregator, most_relavant_entities_aggregator, #500
video_extract_frames_mapper, #507

Performance

Optimize ray mode performance, #442
Patch for Performance Benchmark in CI/CD workflows, #506
DJ Ray mode supports streaming loading of jsonl files, #515

Usability and Analysis

support dj-install in recipe-level, #508
support dj-analyze with --auto mode, #512
support op-wise insight auto mining, #516

Acknowledgment

Thanks to Data-Juicer users and contributors for their helpful feedback, issues and PRs!

Assets 3

06 Dec 09:09

BeachWang

v1.0.1

9f1b0c8

Release v1.0.1

Major Updates

🚀 Supports automatically arranging operators from fastest to slowest based on their execution speed, and also supports automating the operator batch size according to the execution speed. #464
🚀 [UnitTest] Performance benchmark for efficiency tests of 4 modalities. Reports will be uploaded to internal wandb server. #483
💥 Added some useful OPs, including the construction of DPO training data and a lightweight user-customizable OP interface. See more details below~ #491 #492 #493

OPs

Text OPs

pair_preference_mapper: Mapper to construct preference answers for QA pairs. #491

Script OPs

python_lambda_mapper: Mapper for executing customized Python lambda functions on data samples. #492
python_file_mapper: Mapper for executing customized Python functions on data samples. #493

Bugs Fixed

Add an argument to control whether to open Monitor for data processing. It's True by default. #483
For the mp start method of monitor, set it to "spawn" for Windows systems and "fork" for others. #483
Update transformers version to >=4.47.0 to avoid "shape mismatch" bug from older version 4.46.3. #483
Fix the logic errors in Turbo acceleration and batch processing, and ensure that map and filter are consistent in this part of the logic. #504

Others

Pin the PyAV version to prevent inconsistent updates. #504
Skip some unit test for audio OPs to avoid lazy_loader failure during multiprocessing. #503
Remove unnecessary UNFORKABLE marks for some OPs. #491
Refine the docker image building. Add a new self-hosted runner for docker image building, optimize the building logic for auto docker image building on release, change the default full image to a GPU-version image. #494 #501

Acknowledgment

Here we thank public contributors for their PRs and issues to make Data-Juicer better!

Assets 3

22 Nov 02:50

yxdyc

v1.0.0

9caaaa9

Release v1.0.0: Refactor DJ-Dataset & DJ-Operator, Sandbox, and more exciting features!

Major Updates

🚀 Refactor Data-Juicer Operator & Dataset for better usability! We combine our two backends, HuggingFace Dataset and Ray Dataset, into a unified DJ-Dataset, and unify and introduce new invoking interfaces. Based on this, we add a fault-tolerant strategy during the data processing, helping users to know the actual reasons for processing failure. #359 #366
🧪 [Experimental] Data-Juicer Sandbox toolkit is now available! Users are allowed to develop datasets and models in a co-development way with the highly customizable Sandbox to obtain better performance. For more details, please refer to the docs. #273 #291 #312 #332 #364
🚀 Basic API server based on FastAPI is now available in Data-Juicer! Now users can make use of the capabilities of OPs with API service. #468
🚀 Support adaptive resource management:
- Adaptive number of processors for model-based OPs according to the GPU memory and other types of resource utilization. #270 #329 #354
- Adaptive batch size for batched OPs according to their resource utilization to maximize the OP speed. #429
💥 We presented a tutorial of Multi-modal Data Processing for Foundation Models: Practical Guidance and Use Cases on KDD'24. #310
💥 A lot of additions and improvements were made to OPs, DJ-Engine, and CI/CD. See more details below~
🛝 A playground for Data-Juicer is opened for user trial. #277 #368

OPs

Text

ray_document_deduplicator: supports Ray-based distributed exact-match deduplication for text-only datasets. #263
Support sentencepiece tokenizer for MinHash deduplicators. #269
generate_qa_from_text_mapper: generates question and answer pairs from input texts. #333 #454
generate_qa_from_examples_mapper: generates question and answer pairs based on examples. #338 #454
optimize_qa_mapper: optimizes the question-answer pairs in question-answering samples. #338 #454
optimize_query_mapper: optimizes the query in question-answering samples. #338 #454
optimize_response_mapper: optimizes the response in question-answering samples. #454
calibrate_qa_mapper: calibrates question-answer pairs based on reference text. #463
calibrate_query_mapper: calibrates query in question-answer pairs based on reference text. #463
calibrate_response_mapper: calibrates response in question-answer pairs based on reference text. #463
text_chunk_mapper: splits input text to chunks. #481
extract_entity_attribute_mapper: extracts attributes for given entities from the text. #481
extract_entity_relation_mapper: extracts entities and relations in the text for knowledge graph. #481
extract_event_mapper: extracts events and relevant characters in the text. #481
extract_keyword_mapper: generates keywords for the text. #481
extract_nickname_mapper: extracts nickname relationship in the text.. #481

Image

image_face_blur_mapper: blurs faces detected in images. #249
image_nsfw_filter: keeps samples containing images with NSFW scores below the threshold. #252
image_watermark_filter: keeps samples containing images with predicted watermark probabilities below the threshold. #256
ray_image_deduplicator: supports Ray-based distributed exact-match deduplication for image or image-text datasets. #263
image_pair_similarity_filter: keeps image pairs with image feature cosine similarity within the specified range based on a CLIP model. #393
image_tagging_mapper: generates image tags from the input images. #423
image_face_count_filter: keeps samples containing images with face counts within the specified range. #446

Video

video_face_blur_mapper: blurs faces detected in videos. #253
video_remove_watermark_mapper: removes the watermarks in given regions from the videos. #236
video_nsfw_filter: keeps samples containing videos with NSFW scores below the threshold. #252
video_watermark_filter: keeps samples containing videos with predicted watermark probabilities below the threshold. #256
ray_video_deduplicator: supports Ray-based distributed exact-match deduplication for video or video-text datasets. #263
video_tagging_from_frames_filter: keeps samples containing videos with given tags. #260
video_captioning_from_frames_mapper: generates samples whose captions are generated based on an image-to-text model and sampled video frames. Captions from different frames will be concatenated into a single string. #257
video_captioning_from_summarizer_mapper: generates video captions by summarizing several kinds of generated texts (captions from video/audio/frames, tags from audio/frames, ...). #250
video_motion_score_raft_filter: keeps samples with video motion scores (based on RAFT model) within a specific range. #478
Enhance the video_motion_score_filter to support float sampling FPS, frame resizing, optical flow magnitude normalization, and so on. #361

Misc.

Switch face detection used in 3 OPs (image_face_ratio_filter, image_face_blur_mapper, video_face_blur_mapper) from dlib to OpenCV to avoid dependency problems. #320
Deduplicators for multimodal datasets are allowed to consider text information as well. #313
Support batched processing for some OPs. #406 #435

Others (Engine, Job Control and Tools)

Support more multimodal (video) dataset conversion tools: MSR-VTT #248
Support distributed processing script for Slurm. #242
Support Minhash-LSH deduplication tools based on Spark. #290
Enable GPU usage for Ray executor. #274
Add debug mode for Data-Juicer. #303
Add video generation tools for several metrics. #273 #312
Deploy a self-hosted runner for unit tests and enable unit tests for Ray mode. #304
Add sampled frames from videos for video OPs to support OP fusion. #271
Allow to save stats for each OP respectively by specifying the exporting paths for them. #309
Add a new field to record the source files of multimodal data when they are augmented or regenerated by some OPs, so it's convenient to trace back. #317
Support turbo mode to disable some processing-unrelated functions to maximize the processing speed and save resource utilization. #402
Update type annotations from jsonargparse to Pydantic. #422
Add a Monitor module to monitor the resource utilization during data processing for each OP. #429
Allow lazy importing for third-party libraries and installing dependencies if they are not installed. #414 #443
Allow batched processing for all OPs based on the single-sample version of compute_stats/process methods to avoid modifying them to a batched version manually. #448
Enable unit test coverage report. #460
Support invoking API models for interaction with OpenAI-compatible APIs. #463 #479

Document Updates

Refine documentation system based on Sphinx. #245
Regular document updates. #234 #246
Update the class importing and document building logics for better automation. #299
Reorganize the operator documents for better reading. #472

Bugs Fixed

Fix the bug of non-existent videos returned by the video splitting function given a short duration. #243
Fix the bug that the produced multimodal data would be stored in nested dirs in different ops. #247
Fix some problems in demos. #244
Fix "Undefined punctuation_pattern" error in two OPs. #301
Exceptions and errors can be reraised to the upper level and the status code can be returned to the system correctly. #287
Fix the bug of out-of-work type hint checking for config files. #302
Fix the bug of parameters in the base classes that can not be parsed in some OPs. #311
Fix the memory leaking of video OPs. #374
Fix the bug of two OPs (video_aesthetics_filter and image_diffusion_mapper) that can not make use of GPUs. #389
Fix the bug of checkpoints not being restored correctly when the current process list has fewer OPs then the previous one. #391

Acknowledgment

Here we thank public contributors for their PRs to make Data-Juicer better!

@chg0901 helps to fix typos in documents. #237
@lingzhq helps to update the paper list in Awesome Data-Model Co-Development of MLLMs. #289
@shiweijiezero helps fix the bugs in updating the data keys. #300
@seanzhang-zhichen helps to support multiple patterns for replace_content_mapper. #319
@simplaj helps to fix a bug of a non-predefined attribute for video_captioning_from_summarizer_mapper. #343
@zhenqincn helps to reorganize the paper list and add more papers from our survey in Awesome Data-Model Co-Development of MLLMs. #352 #381 #456 #461
@2108038773 helps to add trust_remote_code argument for some public models on HuggingFace. #382 #385
@TobyJasper helps to fix typos in documents and contribute a new OP image_face_count_filter. #392 #452
@co63oc helps to fix some typos in documents and code. #427

Contributors

co63oc, chg0901, and 7 other contributors

Assets 3

07 Mar 12:24

HYLcool

v0.2.0

156ed20

Release v0.2.0: Multimodal Support & DJ-SORA

New Features

🚀 We introduce DJ-SORA to provide open large-scale, high-quality datasets for SORA-like models. #227
🚀 We introduce hundreds of dedicated video, image, audio, text, and other multi-modal data processing operators and tools.
💥 Our paper has been accepted by SIGMOD'24 industrial track! #211
💥 "BetterMixture" — Our second data-centric LLM competition has kicked off and is about to end soon. #174

New OPs

Multimodal

video_frames_text_similarity_filter: keeps samples whose similarities between sampled video frame images and text within a specific range. #227
video_tagging_from_frames_mapper: generates video tags from frames extracted from the video. #227
video_tagging_from_audio_mapper: generates video tags from audio streams extracted from videos. #227
video_captioning_from_video_mapper: generates captions from frame images extracted from video to augment datasets. #227
video_captioning_from_audio_mapper: captions a video according to its audio streams. #227
image_captioning_mapper: generates captions based on a language model and the image. This OP will increase the number of samples in the dataset. #131 #191 #227
image_captioning_from_gpt4v_mapper: generates captions based on GPT-4-Vision and the image. This OP will increase the number of samples in the dataset. #214 #227
image_diffusion_mapper: generates and augments the images based on the Stable Diffusion model and their original images and texts. This OP will increase the number of samples in the dataset. #200

Video

Filter

video_duration_filter: keeps samples whose videos' durations are within a specified range. #227
video_aspect_ratio_filter: filters samples according to the aspect ratios of videos (a fraction of width by height, r=w/h) in them. #227
video_resolution_filter: filters samples according to the resolution of videos in them. #227
video_ocr_area_ratio_filter: keeps samples whose detected text area ratios for specified frames in the video are within a specified range. #227
video_aesthetics_filter: filters samples according to the aesthetics score of frame images extracted from videos. #227
video_motion_score_filter: keeps samples with video motion scores within a specific range. #227

Mapper

video_split_by_scene_mapper: splits videos into scene clips. #227
video_split_by_duration_mapper: splits videos by specified duration interval. #227
video_split_by_key_frame_mapper: splits videos by their keyframes. #227
video_resize_aspect_ratio_mapper: resizes aspect ratios of videos (a fraction of width by height, r=w/h) to a specified range. #227
video_resize_resolution_mapper: maps videos to ones with a given resolution range. #227
video_ffmpeg_wrapped_mapper: a wrapper to apply ffmpeg to video data more conveniently. #227

Deduplicator

video_deduplicator: deduplicates samples at document-level using exact matching of videos between documents. #227

Audio

audio_duration_filter: keeps samples whose audios' durations are within a specified range. #177
audio_size_filter: keeps samples whose audios' sizes are within a specified range. #184
audio_nmf_snr_filter: keeps samples whose audios' Signal Noise Ratios (computed based on Non-Negative Matrix Factorization algorithm) are within a specified range. #189
audio_ffmpeg_wrapped_mapper: a wrapper to apply ffmpeg to audio data more conveniently. #227

Image

image_blur_mapper: adds random noises to images to blur them. #180
image_aesthetics_filter: filter samples according to the aesthetics scores of images. #227

Document Updates

"Bad" Data Exhibition EN ZH: shows how Data-Juicer finds those "bad" data and how they look like.
Awesome LLM Data EN: a collection of awesome LLM datasets with fine-grained tags.
Developer Guide enhancement EN ZH: adds guides on how to accelerate the models in your OP with GPUs and how to implement a batched OP for sample augmentation. #203 #220
OP Insight Visualization Demo code: adds a demo to visualize how each OP works.

Bugs Fixed

Fix stats computation error in the ray mode due to the inappropriate initialization method. #173
Fix the bug that some images will be lost when converting their paths to absolute paths. #178
Fix the dependency problems of OPs who depend on other OPs. #181
Fix the bug that the predict.py tool gets stuck on the help page. #183
Fix face_area_filter: constrains the detection coordinates within the image. #202
Fix MMC4 conversion tools: resolves the situation where multiple images match the same sentence. #195
Fix or update invalid links in Data-Juicer. #201 #219

Others

Optimize the model management module. #196 #227
Optimize the unit test actions. #195 #196 #216 #227
Optimize the multiprocessing strategy and model inference efficiency could be increased due to GPU support. #203 #217 #222 #227
Update the docker image with JDK. #208
Support more multimodal (video) dataset conversion tools: #227
- InternVid: 234M video-caption data
- Youku-mPLUG: 36TB video-caption data
- Video-ChatGPT: 100k video-instruction data
Optimize the generated multimodal data storage. #227
Support running data-juicer process jobs on Aliyun PAI-DLC. #227
Better support for multi-machine distributed data processing in Ray mode. #227

Acknowledgment

Here we thank public contributors for their PRs to make Data-Juicer better!

@liuyanyi helps to fix a bug in quality classifier tools. #183
@co63oc helps to fix some typos. #215
@liuyanyi helps to provide the solution to add JDK in the docker image. #182 #208
@zhenqincn helps to add more papers to the Awesome LLM Data doc. #226

Contributors

co63oc, liuyanyi, and zhenqincn

Assets 3

05 Jan 09:31

HYLcool

v0.1.3

a3c8310

Release v0.1.3: support more Python versions; support multimodal data; more OPs; bugs fixed

New Features

Data-Juicer now supports Python3.7-3.10!
- We released a pybind version of simhash-py library named simhash-pybind to solve the Python version limitation problem.
- We test several version-depend third-party libraries (e.g. dill, kenlm, ...) and validate their availability on different Python versions.
Multimodal dataset analysis and processing are now supported. #64 #91 #95 #106
- A novel intermediate multimodal sample format: using some special tokens to split text chunks and represent non-text information.
- Several dataset format conversion tools for popular multimodal datasets: LLaVA, MMC4, WavCaps, ......
- Lots of multimodal OPs are also released: see categories Image and Multimodal in the section New OPs below.
Auto-HPO tools are now available, which can help users find better hyperparameters for OPs according to specified object functions or with simple 3-sigma rules only. #65 #140
Some content cleaning mappers (e.g. email, IP, ...) now support replacing regex patterns with specified strings, not just with empty ones. Additionally, a general version OP is implemented as a new OP replace_content_mapper. #143
Some collectors, metrics, and drawing functions are added to the analysis module to help users measure the token distribution of a single dataset or distribution difference between different datasets. #160

New OPs

Text

chinese_convert_mapper: converts Chinese between Traditional Chinese, Simplified Chinese, and Japanese Kanji (by opencc) #51
remove_non_chinese_character_mapper: removes non-Chinese characters in text samples. #51
text_action_filter: keeps samples containing action verbs in their texts. #122
text_entity_dependency_filter: keeps samples containing entity nouns related to other tokens in the dependency tree of the texts. #122
replace_content_mapper: replaces all content in the text that matches a specific regular expression pattern with a designated replacement string. #143
remove_repeat_sentences_mapper: Remove repeated sentences in the text. #149

Image

image_shape_filter: keeps samples containing images with widths and heights within the specified ranges. #74
image_aspect_ratio_filter: keeps samples containing images with aspect ratios (w/h) within the specified range. #64
image_size_filter: keeps samples containing images whose sizes in bytes are within the specified range. #73
face_area_filter: keeps samples containing images with face area ratios within the specified range. #110
image_deduplicator: deduplicates samples at document-level using exact matching of images between documents. #72

Multimodal

image_text_similarity_filter: keeps samples with image-text feature cosine similarity within the specified range based on a CLIP model. #69
image_text_matching_filter: keeps samples with image-text classification matching scores within the specified range based on a BLIP model. #100
phrase_grounding_recall_filter: keeps samples whose locating/grounding recalls of phrases extracted from text in the images are within a specified range. #139

Bugs fixed

Fix the pandas==2.0.0 fsspec==2023.3.0 to avoid unexpected errors from third-party dependencies. #38 #42
Fix the bug when OPs nlpaug_en_mapper and nlpcda_zh_mapper generate indefinite numbers of augmented samples. #76
Fix the bug of maximum_line_length_filter might generate unaligned types of stats (int v.s. float), which leads to an error when processing datasets. #147
Fix the bug of missing attribute dataset_dir when the input dataset path is remote or a mixture of several datasets. #155 #157
Fix the bug of commandline arguments parsing error in some cases. #108 #165
Store simhash value as string type to avoid errors from PyArrow. #168 #170

Others

Dependency importing optimization: only require and import some dependencies when using. #35 #82
Release demos and datasets on HuggingFace, and release models trained with our refined datasets on both ModelScope and HuggingFace. #42 #54
Optimize the cache directory selection logic. #43
Support limiting the number of samples when mixing datasets. #86
Avoid extra unnecessary model preparation when enabling tokenization in some OPs. #99
OP language_id_score_filter supports keeping samples in multiple languages now. #125 #151

Acknowledgement

Here we thank public contributors for their PRs to make Data-Juicer better!

@JONGSKY helps to remove some unnecessary code. #85
@xuruidong helps to fix several broken links in the README doc. #142

Contributors

xuruidong and JONGSKY

Assets 3

28 Sep 06:32

HYLcool

v0.1.2

5bd715d

Release v0.1.2: more core functions are available now.

New OPs

nlpaug_en_mapper: simple data augmentation using nlpaug library for English corpus. #17
nlpcda_zh_mapper: simple data augmentation using nlpcda library for Chinese corpus. #17
token_num_filter: filter out samples by the number of tokens in them. HF tokenizers are supported. #24

New features

OP Fusion #14
- Now Filters that share the same contextual variables can be fused into one OP, saving at most 25% time when processing datasets.
Cache management #19
- Cache management works now for our Data-Juicer due to the new serialization method being applied.
- Cache compression is supported: it will automatically compress caches when they are useless and decompress them if needed, which saves at most 50% disk space.
Distributed data processing with Ray is supported now. #21
Config sys optimization:
- Only keep text_keys and remove previous misleading arg text_key(s)_to_process/load. #13
- A new argument export_in_parallel is added to control whether export the result datasets in parallel. #17
- Display the config table after config parsing is ready. #17

Others

Replace original string constants with constant enums. #13
Expand the checkpoint protection range to cover the exporting process. #14
Remove extra intermediate variables storage in document_simhash_deduplicator to save more memory. #14
Docs updates. #15 #16
PyPi package is available. You can install data-juicer by pip install py-data-juicer now. #23
Docker building is available now. The official docker image for Docker Hub is in progress. #23
Deploy the unit tests for Data-Juicer. #29

Assets 3

Releases: modelscope/data-juicer

Release v1.2.1

Major Updates

New OPs

Others

Acknowledgement

Contributors

v1.2.0 Doc refactored; New algorithm proposed

What's New

Detailed PRs

Acknowledgment

Contributors

Release v1.1.0

Major Updates

New OPs

Bug Fixed

Others

Acknowledgement

Contributors

Release v1.0.3: More Powerful Distributed MinHashLSH Deduplicator; Post-Tuning Formats & OPs; Ray Actor for GPU-based OPs

Major Updates

New OPs

Post-tuning OPs for fine-grained analysis of dialog data. #513

Mapper

Aggregator

Selector

Grouper

Bug Fixed

Acknowledgement

Contributors

Release v1.0.2

Major Updates

DJ-Operators

Performance

Usability and Analysis

Acknowledgment

Release v1.0.1

Major Updates

OPs

Text OPs

Script OPs

Bugs Fixed

Others

Acknowledgment

Release v1.0.0: Refactor DJ-Dataset & DJ-Operator, Sandbox, and more exciting features!

Major Updates

OPs

Text

Image

Video

Misc.

Others (Engine, Job Control and Tools)

Document Updates

Bugs Fixed

Acknowledgment

Contributors

Release v0.2.0: Multimodal Support & DJ-SORA

New Features

New OPs

Multimodal

Video

Filter

Mapper

Deduplicator

Audio

Image

Document Updates

Bugs Fixed

Others

Acknowledgment

Contributors

Release v0.1.3: support more Python versions; support multimodal data; more OPs; bugs fixed

New Features

New OPs

Text

Image

Multimodal

Bugs fixed

Others

Acknowledgement