Releases: modelscope/data-juicer
Releases · modelscope/data-juicer
Release v1.2.1
Major Updates
DJ has been integrated in Ray's official Ecosystem and Example Gallery. Besides, our patch in DJ2.0 for the streaming JSON reader has been officially integrated by Apache Arrow.
Our work on contrastive data synthesis, ImgDiff, has been accepted by CVPR 2025!
- Unit test optimization:
- split unit tests to partial and regression: partial test is triggered by PR and only test on corresponding test cases of changed files; regression test on all cases and triggered at 7:00 on every Friday in Beijing time. #598
- use primitive
@unittest.skip
and removeSKIPPED_TESTS
. #586 - upload test coverage reports to GitHub artifacts. #586
New OPs
image_remove_background_mapper
: remove the background of images. #589
Others
- add missing LOADED_AUDIOS to ALL_INTER_VARS to enable OP fusion and context sharing. #585
- only build doc for py3.10. #586
- move dependency on
ray
to minimal requirements. #586 #594 #595 - allow executor and other tool functions to consume a loaded dataset in addition to the config file. #596 #597
- fix undefined
fileno
bug of the logger. #594
Acknowledgement
- @liuyuhanalex helps simplify the code logic of OP fusion, add a new OP
image_remove_background_mapper
, and fix some minor bugs. #581 #585 #589 - @co63oc helps to fix typos in code and documents. #582 #583 #588 #591 #593
- @danielhjz helps to fix the implicit memory leak problem in
image_nsfw_filter
. #590
v1.2.0 Doc refactored; New algorithm proposed
What's New
- 📚 The DJ doc is refactored and improved, e.g., RecipeGallery, DeveloperGuide, DistributedProcess, DJ-related Competitions, typos bad links
- 🔎 More unit-tests added.
- 🎛 The data pre-split and export are improved.
- 🔮 A new data selection method, DaaR, is proposed. See Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data.
Detailed PRs
- fix export error when export_stats columns is null in #557
- Resplit input dataset in ray mode in #549
- Refactor and improve doc for RecipeGallery, DeveloperGuide, DistributedProcess and DJ-related Competitions in #561
- Resolve most skipped unit-tests by in #559
- fix translation error in #562
- Add unittest for ray text dedup in #540
- [Typo]correct a small typo in #563
- update the 2.0 paper link & the DaaR news in #566
- Fix typos in #571
- Optimization for sdxl_prompt2prompt_mapper dependency importing by in #570
- Fix typos in #572
Acknowledgment
- @liuyuhanalex @co63oc made their first PRs
Full Changelog: v1.1.0...v1.2.0
Release v1.1.0
Major Updates
- 🧪 User now can run ray-based distributed data processing under the guidance of added docs. #523
- 🧪 The DJ-Cookbook has gathered numerous high-quality data processing recipes from various vertical fields, and the related documents have been updated on the homepage. #542
- 💥 Change Task mode to Actor mode for ray deduplication, allowing users to use these operators without installing Redis. #526
- 🚀 Append a log summarization for warnings and errors at the running ending to make them recognizable under the sample fault tolerance mechanism. #534
- 🚀 Automatically update relevant documents when adding OPs to reduce the development burden. #527
- 🛝 Add usability tags for OPs:
alpha
tag for OPs in which only the basic OP implementations are finished;beta
tag for OPs in which unittests are added based on thealpha
version;stable
tag for OPs in which OP optimizations related to DJ (e.g. model management, batched processing, OP fusion, ...) are added based on thebeta
version.
New OPs
image_segment_mapper
: Perform segment-anything on images and return the bounding boxes. #550mllm_mapper
: Mapper to use MLLMs to generate texts for images. #550sdxl_prompt2prompt_mapper
: Use the generative model SDXL and image editing technique Prompt-to-Prompt to generate pairs of similar images. #550sentence_augmentation_mapper
: Augment sentences using LLMs. #550text_pair_similarity_filter
: Filter samples according to the similarity score between the text pair. #550
Bug Fixed
- Add global
skip_op_error
param to enable fault-tolerant when execute DataJuicer analyzer and executor, but disable fault-tolerant for unit test. #528 - Fix model force download bug. #529
- Fix
IndexError
if the number of samples in the result dataset is less than the number of workers when saving dataset to disk. #536 - Fix missing field meta tag on ray mode. #538
- Update
max_tokens
ormax_new_tokens
for vllm-based OPs to avoid too short generation. #544 - Fix bug in the role playing data generation demo. #545
Others
- Enhance unit test for API calling OPs. #528
- Remove sandbox requirements installation from Dockerfile. #530
- Update the
datasource
related APIs to be compatible with the latest version of Ray. #532 - Limit the generated qa num for each text in
generate_qa_from_text_mapper
. #541 - Update docs for preparing DJ2.0 release. #542
- Update a quick cdn link for arch figure. #543
- Add a video demo for role playing data generation. #545
- Optimize op doc for global textual search. #552
- Use a more stable and fast translator than google translator for automatic OP doc building. #554
Acknowledgement
- @Qirui-jiao made great contributions to enrich the Data-Juicer OP pool. #550
Release v1.0.3: More Powerful Distributed MinHashLSH Deduplicator; Post-Tuning Formats & OPs; Ray Actor for GPU-based OPs
Major Updates
- 💥 Support Ray-based MinHashLSH deduplicator, which implemented a multi-process Union-Find set based on Ray Actor and BTS algorithm to complete equivalence class merging. #502
- 💥 Support post-tuning dataset formats in LLaMA-Factory and ModelScope-Swift.
- Data-Juicer chooses the Query-Response format as the intermediate format for the post-tuning dataset. #514
- Refine the overall intermediate format of Data-Juicer to support various dataset formats better. (
meta
,stats
) #514 #518 - Provide several format conversion tools for converting to Data-Juicer format and vice versa. #514
- 🚀 Add 10 more post-tuning OPs to process post-tuning datasets better. It's listed in detail in the below New OPs section. #513
- 🚀 Support Ray Actor mode for GPU-based OPs. #511
New OPs
Post-tuning OPs for fine-grained analysis of dialog data. #513
Mapper
dialog_intent_detection_mapper
: Mapper to generate user's intent labels in feed back dialog data.dialog_sentiment_detection_mapper
: Mapper to generate user's sentiment labels in feed back dialog data.dialog_sentiment_intensity_mapper
: Mapper to predict user's sentiment intensity (from -5 to 5 in default
prompt) in feed back dialog data.dialog_topic_detection_mapper
: Mapper to generate user's topic labels in feed back dialog data.query_intent_detection_mapper
: Mapper to predict user's Intent label in a query.query_sentiment_detection_mapper
: Mapper to predict user's sentiment label ('negative', 'neutral' and
'positive') in a query.query_topic_detection_mapper
: Mapper to predict user's topic label in a query.
Aggregator
meta_tags_aggregator
: Merge similar meta tags to one tag.
Selector
tags_specified_field_selector
: Select samples based on the tags of specified field.
Grouper
naive_reverse_grouper
: Split bathed sample to samples.
Bug Fixed
- Fix the wrong argument passing in
generate_qa_from_example_mapper
. #517 - Update the out-of-date Dingding QR code on the main page. #513
Acknowledgement
- @jackylee-ch made their first contribution to help fix several invalid links in the document. #521
Full Changelog: v1.0.2...v1.0.3
Release v1.0.2
Major Updates
- Added more mapper/grouper/aggregator OPs for post-tuning scenarios.
- Optimized the distributed mode performance and usability with more automatic features.
DJ-Operators
extract_support_text_mapper
,relation_identity_mapper
,python_file_mapper
, #500naive_grouper
,key_value_grouper
, #500nested_aggregator
,entity_attribute_aggregator
,most_relavant_entities_aggregator
, #500video_extract_frames_mapper
, #507
Performance
- Optimize ray mode performance, #442
- Patch for Performance Benchmark in CI/CD workflows, #506
- DJ Ray mode supports streaming loading of
jsonl
files, #515
Usability and Analysis
- support dj-install in recipe-level, #508
- support dj-analyze with --auto mode, #512
- support op-wise insight auto mining, #516
Acknowledgment
Thanks to Data-Juicer users and contributors for their helpful feedback, issues and PRs!
Release v1.0.1
Major Updates
- 🚀 Supports automatically arranging operators from fastest to slowest based on their execution speed, and also supports automating the operator batch size according to the execution speed. #464
- 🚀 [UnitTest] Performance benchmark for efficiency tests of 4 modalities. Reports will be uploaded to internal wandb server. #483
- 💥 Added some useful OPs, including the construction of DPO training data and a lightweight user-customizable OP interface. See more details below~ #491 #492 #493
OPs
Text OPs
pair_preference_mapper
: Mapper to construct preference answers for QA pairs. #491
Script OPs
python_lambda_mapper
: Mapper for executing customized Python lambda functions on data samples. #492python_file_mapper
: Mapper for executing customized Python functions on data samples. #493
Bugs Fixed
- Add an argument to control whether to open
Monitor
for data processing. It's True by default. #483 - For the mp start method of monitor, set it to "spawn" for Windows systems and "fork" for others. #483
- Update transformers version to >=4.47.0 to avoid "shape mismatch" bug from older version 4.46.3. #483
- Fix the logic errors in Turbo acceleration and batch processing, and ensure that map and filter are consistent in this part of the logic. #504
Others
- Pin the PyAV version to prevent inconsistent updates. #504
- Skip some unit test for audio OPs to avoid lazy_loader failure during multiprocessing. #503
- Remove unnecessary UNFORKABLE marks for some OPs. #491
- Refine the docker image building. Add a new self-hosted runner for docker image building, optimize the building logic for auto docker image building on release, change the default full image to a GPU-version image. #494 #501
Acknowledgment
Here we thank public contributors for their PRs and issues to make Data-Juicer better!
Release v1.0.0: Refactor DJ-Dataset & DJ-Operator, Sandbox, and more exciting features!
Major Updates
- 🚀 Refactor Data-Juicer Operator & Dataset for better usability! We combine our two backends, HuggingFace Dataset and Ray Dataset, into a unified DJ-Dataset, and unify and introduce new invoking interfaces. Based on this, we add a fault-tolerant strategy during the data processing, helping users to know the actual reasons for processing failure. #359 #366
- 🧪 [Experimental] Data-Juicer Sandbox toolkit is now available! Users are allowed to develop datasets and models in a co-development way with the highly customizable Sandbox to obtain better performance. For more details, please refer to the docs. #273 #291 #312 #332 #364
- 🚀 Basic API server based on FastAPI is now available in Data-Juicer! Now users can make use of the capabilities of OPs with API service. #468
- 🚀 Support adaptive resource management:
- 💥 We presented a tutorial of Multi-modal Data Processing for Foundation Models: Practical Guidance and Use Cases on KDD'24. #310
- 💥 A lot of additions and improvements were made to OPs, DJ-Engine, and CI/CD. See more details below~
- 🛝 A playground for Data-Juicer is opened for user trial. #277 #368
OPs
Text
ray_document_deduplicator
: supports Ray-based distributed exact-match deduplication for text-only datasets. #263- Support sentencepiece tokenizer for MinHash deduplicators. #269
generate_qa_from_text_mapper
: generates question and answer pairs from input texts. #333 #454generate_qa_from_examples_mapper
: generates question and answer pairs based on examples. #338 #454optimize_qa_mapper
: optimizes the question-answer pairs in question-answering samples. #338 #454optimize_query_mapper
: optimizes the query in question-answering samples. #338 #454optimize_response_mapper
: optimizes the response in question-answering samples. #454calibrate_qa_mapper
: calibrates question-answer pairs based on reference text. #463calibrate_query_mapper
: calibrates query in question-answer pairs based on reference text. #463calibrate_response_mapper
: calibrates response in question-answer pairs based on reference text. #463text_chunk_mapper
: splits input text to chunks. #481extract_entity_attribute_mapper
: extracts attributes for given entities from the text. #481extract_entity_relation_mapper
: extracts entities and relations in the text for knowledge graph. #481extract_event_mapper
: extracts events and relevant characters in the text. #481extract_keyword_mapper
: generates keywords for the text. #481extract_nickname_mapper
: extracts nickname relationship in the text.. #481
Image
image_face_blur_mapper
: blurs faces detected in images. #249image_nsfw_filter
: keeps samples containing images with NSFW scores below the threshold. #252image_watermark_filter
: keeps samples containing images with predicted watermark probabilities below the threshold. #256ray_image_deduplicator
: supports Ray-based distributed exact-match deduplication for image or image-text datasets. #263image_pair_similarity_filter
: keeps image pairs with image feature cosine similarity within the specified range based on a CLIP model. #393image_tagging_mapper
: generates image tags from the input images. #423image_face_count_filter
: keeps samples containing images with face counts within the specified range. #446
Video
video_face_blur_mapper
: blurs faces detected in videos. #253video_remove_watermark_mapper
: removes the watermarks in given regions from the videos. #236video_nsfw_filter
: keeps samples containing videos with NSFW scores below the threshold. #252video_watermark_filter
: keeps samples containing videos with predicted watermark probabilities below the threshold. #256ray_video_deduplicator
: supports Ray-based distributed exact-match deduplication for video or video-text datasets. #263video_tagging_from_frames_filter
: keeps samples containing videos with given tags. #260video_captioning_from_frames_mapper
: generates samples whose captions are generated based on an image-to-text model and sampled video frames. Captions from different frames will be concatenated into a single string. #257video_captioning_from_summarizer_mapper
: generates video captions by summarizing several kinds of generated texts (captions from video/audio/frames, tags from audio/frames, ...). #250video_motion_score_raft_filter
: keeps samples with video motion scores (based on RAFT model) within a specific range. #478- Enhance the
video_motion_score_filter
to support float sampling FPS, frame resizing, optical flow magnitude normalization, and so on. #361
Misc.
- Switch face detection used in 3 OPs (
image_face_ratio_filter
,image_face_blur_mapper
,video_face_blur_mapper
) fromdlib
toOpenCV
to avoid dependency problems. #320 - Deduplicators for multimodal datasets are allowed to consider text information as well. #313
- Support batched processing for some OPs. #406 #435
Others (Engine, Job Control and Tools)
- Support more multimodal (video) dataset conversion tools: MSR-VTT #248
- Support distributed processing script for Slurm. #242
- Support Minhash-LSH deduplication tools based on Spark. #290
- Enable GPU usage for Ray executor. #274
- Add debug mode for Data-Juicer. #303
- Add video generation tools for several metrics. #273 #312
- Deploy a self-hosted runner for unit tests and enable unit tests for Ray mode. #304
- Add sampled frames from videos for video OPs to support OP fusion. #271
- Allow to save stats for each OP respectively by specifying the exporting paths for them. #309
- Add a new field to record the source files of multimodal data when they are augmented or regenerated by some OPs, so it's convenient to trace back. #317
- Support
turbo
mode to disable some processing-unrelated functions to maximize the processing speed and save resource utilization. #402 - Update type annotations from
jsonargparse
toPydantic
. #422 - Add a Monitor module to monitor the resource utilization during data processing for each OP. #429
- Allow lazy importing for third-party libraries and installing dependencies if they are not installed. #414 #443
- Allow batched processing for all OPs based on the single-sample version of compute_stats/process methods to avoid modifying them to a batched version manually. #448
- Enable unit test coverage report. #460
- Support invoking API models for interaction with OpenAI-compatible APIs. #463 #479
Document Updates
- Refine documentation system based on Sphinx. #245
- Regular document updates. #234 #246
- Update the class importing and document building logics for better automation. #299
- Reorganize the operator documents for better reading. #472
Bugs Fixed
- Fix the bug of non-existent videos returned by the video splitting function given a short duration. #243
- Fix the bug that the produced multimodal data would be stored in nested dirs in different ops. #247
- Fix some problems in demos. #244
- Fix "Undefined punctuation_pattern" error in two OPs. #301
- Exceptions and errors can be reraised to the upper level and the status code can be returned to the system correctly. #287
- Fix the bug of out-of-work type hint checking for config files. #302
- Fix the bug of parameters in the base classes that can not be parsed in some OPs. #311
- Fix the memory leaking of video OPs. #374
- Fix the bug of two OPs (
video_aesthetics_filter
andimage_diffusion_mapper
) that can not make use of GPUs. #389 - Fix the bug of checkpoints not being restored correctly when the current process list has fewer OPs then the previous one. #391
Acknowledgment
Here we thank public contributors for their PRs to make Data-Juicer better!
- @chg0901 helps to fix typos in documents. #237
- @lingzhq helps to update the paper list in Awesome Data-Model Co-Development of MLLMs. #289
- @shiweijiezero helps fix the bugs in updating the data keys. #300
- @seanzhang-zhichen helps to support multiple patterns for
replace_content_mapper
. #319 - @simplaj helps to fix a bug of a non-predefined attribute for
video_captioning_from_summarizer_mapper
. #343 - @zhenqincn helps to reorganize the paper list and add more papers from our survey in Awesome Data-Model Co-Development of MLLMs. #352 #381 #456 #461
- @2108038773 helps to add
trust_remote_code
argument for some public models on HuggingFace. #382 #385 - @TobyJasper helps to fix typos in documents and contribute a new OP
image_face_count_filter
. #392 #452 - @co63oc helps to fix some typos in documents and code. #427
Release v0.2.0: Multimodal Support & DJ-SORA
New Features
- 🚀 We introduce DJ-SORA to provide open large-scale, high-quality datasets for SORA-like models. #227
- 🚀 We introduce hundreds of dedicated video, image, audio, text, and other multi-modal data processing operators and tools.
- 💥 Our paper has been accepted by SIGMOD'24 industrial track! #211
- 💥 "BetterMixture" — Our second data-centric LLM competition has kicked off and is about to end soon. #174
New OPs
Multimodal
video_frames_text_similarity_filter
: keeps samples whose similarities between sampled video frame images and text within a specific range. #227video_tagging_from_frames_mapper
: generates video tags from frames extracted from the video. #227video_tagging_from_audio_mapper
: generates video tags from audio streams extracted from videos. #227video_captioning_from_video_mapper
: generates captions from frame images extracted from video to augment datasets. #227video_captioning_from_audio_mapper
: captions a video according to its audio streams. #227image_captioning_mapper
: generates captions based on a language model and the image. This OP will increase the number of samples in the dataset. #131 #191 #227image_captioning_from_gpt4v_mapper
: generates captions based on GPT-4-Vision and the image. This OP will increase the number of samples in the dataset. #214 #227image_diffusion_mapper
: generates and augments the images based on the Stable Diffusion model and their original images and texts. This OP will increase the number of samples in the dataset. #200
Video
Filter
video_duration_filter
: keeps samples whose videos' durations are within a specified range. #227video_aspect_ratio_filter
: filters samples according to the aspect ratios of videos (a fraction of width by height, r=w/h) in them. #227video_resolution_filter
: filters samples according to the resolution of videos in them. #227video_ocr_area_ratio_filter
: keeps samples whose detected text area ratios for specified frames in the video are within a specified range. #227video_aesthetics_filter
: filters samples according to the aesthetics score of frame images extracted from videos. #227video_motion_score_filter
: keeps samples with video motion scores within a specific range. #227
Mapper
video_split_by_scene_mapper
: splits videos into scene clips. #227video_split_by_duration_mapper
: splits videos by specified duration interval. #227video_split_by_key_frame_mapper
: splits videos by their keyframes. #227video_resize_aspect_ratio_mapper
: resizes aspect ratios of videos (a fraction of width by height, r=w/h) to a specified range. #227video_resize_resolution_mapper
: maps videos to ones with a given resolution range. #227video_ffmpeg_wrapped_mapper
: a wrapper to apply ffmpeg to video data more conveniently. #227
Deduplicator
video_deduplicator
: deduplicates samples at document-level using exact matching of videos between documents. #227
Audio
audio_duration_filter
: keeps samples whose audios' durations are within a specified range. #177audio_size_filter
: keeps samples whose audios' sizes are within a specified range. #184audio_nmf_snr_filter
: keeps samples whose audios' Signal Noise Ratios (computed based on Non-Negative Matrix Factorization algorithm) are within a specified range. #189audio_ffmpeg_wrapped_mapper
: a wrapper to apply ffmpeg to audio data more conveniently. #227
Image
image_blur_mapper
: adds random noises to images to blur them. #180image_aesthetics_filter
: filter samples according to the aesthetics scores of images. #227
Document Updates
- "Bad" Data Exhibition EN ZH: shows how Data-Juicer finds those "bad" data and how they look like.
- Awesome LLM Data EN: a collection of awesome LLM datasets with fine-grained tags.
- Developer Guide enhancement EN ZH: adds guides on how to accelerate the models in your OP with GPUs and how to implement a batched OP for sample augmentation. #203 #220
- OP Insight Visualization Demo code: adds a demo to visualize how each OP works.
Bugs Fixed
- Fix stats computation error in the ray mode due to the inappropriate initialization method. #173
- Fix the bug that some images will be lost when converting their paths to absolute paths. #178
- Fix the dependency problems of OPs who depend on other OPs. #181
- Fix the bug that the
predict.py
tool gets stuck on the help page. #183 - Fix
face_area_filter
: constrains the detection coordinates within the image. #202 - Fix MMC4 conversion tools: resolves the situation where multiple images match the same sentence. #195
- Fix or update invalid links in Data-Juicer. #201 #219
Others
- Optimize the model management module. #196 #227
- Optimize the unit test actions. #195 #196 #216 #227
- Optimize the multiprocessing strategy and model inference efficiency could be increased due to GPU support. #203 #217 #222 #227
- Update the docker image with JDK. #208
- Support more multimodal (video) dataset conversion tools: #227
- InternVid: 234M video-caption data
- Youku-mPLUG: 36TB video-caption data
- Video-ChatGPT: 100k video-instruction data
- Optimize the generated multimodal data storage. #227
- Support running data-juicer process jobs on Aliyun PAI-DLC. #227
- Better support for multi-machine distributed data processing in Ray mode. #227
Acknowledgment
Here we thank public contributors for their PRs to make Data-Juicer better!
Release v0.1.3: support more Python versions; support multimodal data; more OPs; bugs fixed
New Features
- Data-Juicer now supports Python3.7-3.10!
- We released a pybind version of simhash-py library named
simhash-pybind
to solve the Python version limitation problem. - We test several version-depend third-party libraries (e.g. dill, kenlm, ...) and validate their availability on different Python versions.
- We released a pybind version of simhash-py library named
- Multimodal dataset analysis and processing are now supported. #64 #91 #95 #106
- A novel intermediate multimodal sample format: using some special tokens to split text chunks and represent non-text information.
- Several dataset format conversion tools for popular multimodal datasets: LLaVA, MMC4, WavCaps, ......
- Lots of multimodal OPs are also released: see categories Image and Multimodal in the section New OPs below.
- Auto-HPO tools are now available, which can help users find better hyperparameters for OPs according to specified object functions or with simple 3-sigma rules only. #65 #140
- Some content cleaning mappers (e.g. email, IP, ...) now support replacing regex patterns with specified strings, not just with empty ones. Additionally, a general version OP is implemented as a new OP
replace_content_mapper
. #143 - Some collectors, metrics, and drawing functions are added to the analysis module to help users measure the token distribution of a single dataset or distribution difference between different datasets. #160
New OPs
Text
chinese_convert_mapper
: converts Chinese between Traditional Chinese, Simplified Chinese, and Japanese Kanji (by opencc) #51remove_non_chinese_character_mapper
: removes non-Chinese characters in text samples. #51text_action_filter
: keeps samples containing action verbs in their texts. #122text_entity_dependency_filter
: keeps samples containing entity nouns related to other tokens in the dependency tree of the texts. #122replace_content_mapper
: replaces all content in the text that matches a specific regular expression pattern with a designated replacement string. #143remove_repeat_sentences_mapper
: Remove repeated sentences in the text. #149
Image
image_shape_filter
: keeps samples containing images with widths and heights within the specified ranges. #74image_aspect_ratio_filter
: keeps samples containing images with aspect ratios (w/h) within the specified range. #64image_size_filter
: keeps samples containing images whose sizes in bytes are within the specified range. #73face_area_filter
: keeps samples containing images with face area ratios within the specified range. #110image_deduplicator
: deduplicates samples at document-level using exact matching of images between documents. #72
Multimodal
image_text_similarity_filter
: keeps samples with image-text feature cosine similarity within the specified range based on a CLIP model. #69image_text_matching_filter
: keeps samples with image-text classification matching scores within the specified range based on a BLIP model. #100phrase_grounding_recall_filter
: keeps samples whose locating/grounding recalls of phrases extracted from text in the images are within a specified range. #139
Bugs fixed
- Fix the
pandas==2.0.0 fsspec==2023.3.0
to avoid unexpected errors from third-party dependencies. #38 #42 - Fix the bug when OPs
nlpaug_en_mapper
andnlpcda_zh_mapper
generate indefinite numbers of augmented samples. #76 - Fix the bug of
maximum_line_length_filter
might generate unaligned types of stats (int v.s. float), which leads to an error when processing datasets. #147 - Fix the bug of missing attribute dataset_dir when the input dataset path is remote or a mixture of several datasets. #155 #157
- Fix the bug of commandline arguments parsing error in some cases. #108 #165
- Store simhash value as string type to avoid errors from PyArrow. #168 #170
Others
- Dependency importing optimization: only require and import some dependencies when using. #35 #82
- Release demos and datasets on HuggingFace, and release models trained with our refined datasets on both ModelScope and HuggingFace. #42 #54
- Optimize the cache directory selection logic. #43
- Support limiting the number of samples when mixing datasets. #86
- Avoid extra unnecessary model preparation when enabling tokenization in some OPs. #99
- OP
language_id_score_filter
supports keeping samples in multiple languages now. #125 #151
Acknowledgement
Here we thank public contributors for their PRs to make Data-Juicer better!
- @JONGSKY helps to remove some unnecessary code. #85
- @xuruidong helps to fix several broken links in the README doc. #142
Release v0.1.2: more core functions are available now.
New OPs
nlpaug_en_mapper
: simple data augmentation using nlpaug library for English corpus. #17nlpcda_zh_mapper
: simple data augmentation using nlpcda library for Chinese corpus. #17token_num_filter
: filter out samples by the number of tokens in them. HF tokenizers are supported. #24
New features
- OP Fusion #14
- Now Filters that share the same contextual variables can be fused into one OP, saving at most 25% time when processing datasets.
- Cache management #19
- Cache management works now for our Data-Juicer due to the new serialization method being applied.
- Cache compression is supported: it will automatically compress caches when they are useless and decompress them if needed, which saves at most 50% disk space.
- Distributed data processing with Ray is supported now. #21
- Config sys optimization:
Others
- Replace original string constants with constant enums. #13
- Expand the checkpoint protection range to cover the exporting process. #14
- Remove extra intermediate variables storage in
document_simhash_deduplicator
to save more memory. #14 - Docs updates. #15 #16
- PyPi package is available. You can install data-juicer by
pip install py-data-juicer
now. #23 - Docker building is available now. The official docker image for Docker Hub is in progress. #23
- Deploy the unit tests for Data-Juicer. #29