v0.3.0
What's Changed
- Update README by @ryantwolf in #6
- [Tutorials] Add a readme file for the TinyStories tutorial by @Maghoumi in #5
- Add workflow for running cpu pytests by @ayushdg in #13
- Add pre-commit style checks by @ayushdg in #14
- Add citation by @ryantwolf in #15
- Fix Noisy CUDA Shutdown by @ryantwolf in #20
- Bump Python and RAPIDS versions by @ryantwolf in #16
- Add batched decorator by @ryantwolf in #18
- Add issue templates by @ayushdg in #22
- Add dependency to fix justext by @ryantwolf in #24
- Fix metadata inference with pandas and dask by @ryantwolf in #35
- Disable PyTorch Compile Multiprocessing by @ryantwolf in #34
- Improve speed of AddId module by @ryantwolf in #36
- Make GPU dependencies optional by @ayushdg in #27
- Fix failing GPU tests with latest pandas bump by @ayushdg in #41
- Adds Nemo Curator K8s example by @terrykong in #40
- Move common dedup utils and remove unused code by @ayushdg in #42
- Fix lang id example by @ryantwolf in #37
- Add dataset blending tool by @ryantwolf in #32
- High level fuzzy duplicates module by @ayushdg in #46
- Fix indexing in PII Modifier by @ryantwolf in #55
- Disable string conversion globally by @ryantwolf in #56
- Fix issue #43 (empty files creation) and improve reading/writing speed by @miguelusque in #57
- [Tutorials] Add a tutorial for PEFT data curation by @Maghoumi in #45
- Only import PII constants during Curator import by @ayushdg in #61
- Align
extract_partitioning_index
logic with upstream shuffling by @rjzamora in #60
New Contributors
- @Maghoumi made their first contribution in #5
- @terrykong made their first contribution in #40
- @miguelusque made their first contribution in #57
- @rjzamora made their first contribution in #60
Full Changelog: https://github.com/NVIDIA/NeMo-Curator/commits/v0.3.0