v0.1.8
🎯 Highlights of v0.1.8
Version 0.1.8 marks the initial release of Curator, introducing core functionalities for managing and processing LLM completions for synthetic data generation. This release establishes a foundation with two main components: a completions module for efficient batch processing with OpenAI models, and a dataset viewer for visualizing and managing completion results. Key features include batch processing support, configurable model parameters, streaming capabilities, and metadata management through SQLite integration. The release also prioritizes developer experience with Colab compatibility and robust documentation.
⚡ Completions Module
- Reorganized prompting logic (#2, #3) and improved OpenAI integration (#4, #28)
- Added configurable temperature and top-p parameters (#77)
- Implemented batch size configuration (#70)
- Added fallback token counting with tiktoken (#59)
- Improved dataset management with List objects (#9)
- Added configurable working directory support (#53)
- Fixed Colab compatibility issues (#69, #72)
- Enhanced request/response handling (#65)
🎨 Curator Viewer
- Reorganized dataset viewer architecture (#8)
- Added streaming dataset UI functionality (#14)
- Implemented file streaming for batch mode (#79)
- Added metadata SQLite database integration (#10)
- Fixed compilation errors (#23)
📚 Documentation & Packaging
- Added Apache 2.0 license (#12)
- Improved documentation and README (#1, #26)
- Properly packaged as bespokelabs-curator (#11)
- Added repository logo (#63)
- Updated API key documentation (#32)
What's Changed
- Add and update documentation. by @madiator in #1
- Refactor prompting logic to a class and add a test. by @madiator in #2
- Rename prompt_caller to prompter. by @madiator in #3
- Online processing with OpenAI by @RyanMarten in #4
- rename cache with name first, fingerprint second by @RyanMarten in #5
- Minor refactoring and cleanups. by @madiator in #6
- init commit on bespoke-dataset-viewer by @CharlieJCJ in #8
- Use List of objects instead of HF dataset and remove flatten by @vutrung96 in #9
- add metadata sqlite db by @CharlieJCJ in #10
- Various small fixes by @vutrung96 in #13
- Add apache 2.0 license by @vutrung96 in #12
- Properly package the repo into bespokelabs-curator by @vutrung96 in #11
- Fix file dependency poetry lock by @CharlieJCJ in #15
- Metadata DB existing run_hash by @CharlieJCJ in #17
- Update some references to bella and update readme. by @madiator in #16
- Merge in Ryan's abstraction by @vutrung96 in #18
- Remove OH v3 by @vutrung96 in #19
- Add request payload to GenericResponse by @vutrung96 in #20
- Add an option to keep the dataset in memory for to_huggingface() by @vutrung96 in #21
- Streaming dataset UI by @CharlieJCJ in #14
- dataset viewer compile error fix by @CharlieJCJ in #23
- Fix broken pytest by @vutrung96 in #24
- Explicitly print out data points in camel.py by @vutrung96 in #25
- [add] build for dataset viewer before releasing by @CharlieJCJ in #22
- update README documentation by @CharlieJCJ in #26
- Update README.md by @CharlieJCJ in #27
- Add OpenAIBatch backend and refactor RequestProcessor to be compatible by @RyanMarten in #28
- Add configurable logging (bespokelabs.curator) by @RyanMarten in #29
- improved readme for supplying api_key by @CharlieJCJ in #32
- Fix issues with no dataset passed to batch and logging by @RyanMarten in #36
- Remove unused dependencies: litellm and ipython. by @madiator in #37
- Rename from poetry.py to poem.py to reduce confusion with the poetry tool by @madiator in #38
- Catch TypeError when using tiktoken and fall back to hueristic token counting by @RyanMarten in #59
- Better example for generating poems. by @madiator in #43
- Alow specifiying working directory in case users want to use a different working directory for their cache by @vutrung96 in #53
- Add batch_mode as a field in metadata db by @CharlieJCJ in #56
- Merge main to dev by @CharlieJCJ in #66
- Set max line length to 80 for black by @vutrung96 in #64
- Remove the use of asyncio.run to make asyncio work in colab by @vutrung96 in #69
- Package versioning downgrade for colab by @CharlieJCJ in #67
- Add Prompter arg for batch size by @RyanMarten in #70
- GenericRequest and GenericResponse refactor by @RyanMarten in #65
- Prevent .arrow file getting in an invalid state and generate different .arrow based on parse_func by @RyanMarten in #61
- Add logo by @madiator in #63
- Fix types in jsonl files by @RyanMarten in #75
- Add temperature and top-p by @RyanMarten in #77
- Fix asyncio with nest_asyncio by @vutrung96 in #72
- [curator-viewer] file streaming when
batch=True
, new response format adaptation by @CharlieJCJ in #79 - Bump version to 0.1.7 by @vutrung96 in #83
- 0.1.7 by @vutrung96 in #81
- Pre-emptively remove invalid dataset file when prompt_func detected as invalid by @RyanMarten in #84
- Fix JSON parsing from model by @vutrung96 in #85
- Set RLIMIT_NOFILE to avoid "too many files open" errors by @RyanMarten in #73
- Followup fixing pydantic validation by @RyanMarten in #89
- Fixed and refactored sort and filter by @CharlieJCJ in #91
- better no data view and error handling in dataset viewer by @CharlieJCJ in #95
- Cleanup build script and .gitignore for build artifacts by @CharlieJCJ in #71
- Response format for batch bug fix by @RyanMarten in #97
- UI minor type fix when npm run build by @CharlieJCJ in #98
- Fix a confusing error due to asyncio.run in except block. by @vutrung96 in #99
- Explicitly close AsyncClient to avoid getting asyncio event loop is closed issues by @vutrung96 in #101
- 0.1.8 by @CharlieJCJ in #100
New Contributors
- @madiator made their first contribution in #1
- @vutrung96 made their first contribution in #9
Full Changelog: https://github.com/bespokelabsai/curator/commits/v0.1.8