Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete input and output files for successful batches #195

Merged
merged 1 commit into from
Dec 4, 2024

Conversation

RyanMarten
Copy link
Contributor

Fixes #189

Test

from bespokelabs.curator import Prompter
from datasets import Dataset
import logging

# To see more detail about how batches are being processed
logger = logging.getLogger("bespokelabs.curator")
logger.setLevel(logging.INFO)
dataset = Dataset.from_dict({"prompt": ["write me a poem"] * 3})

prompter = Prompter(
    prompt_func=lambda row: row["prompt"],
    model_name="gpt-4o-mini",
    response_format=None,
    batch=True,
    batch_size=1,
)

dataset = prompter(dataset)
print(dataset.to_pandas())

Should print something like

2024-12-03 18:33:02,712 - bespokelabs.curator.request_processor.openai_batch_request_processor - INFO - Batch batch_674fbea486d881919d3eed7675b6ee4f returned with status: completed
Completed OpenAI requests in batches:  67%|████████████████████████████▋              | 2/3 [03:05<01:32, 92.92s/request]2024-12-03 18:33:03,337 - bespokelabs.curator.request_processor.openai_batch_request_processor - INFO - Batch batch_674fbea486d881919d3eed7675b6ee4f completed and downloaded
2024-12-03 18:33:03,341 - bespokelabs.curator.request_processor.openai_batch_request_processor - INFO - Batch batch_674fbea486d881919d3eed7675b6ee4f written to /Users/ryan/.cache/curator/d62efd02c09bf424/responses_1.jsonl
2024-12-03 18:33:03,848 - bespokelabs.curator.request_processor.openai_batch_request_processor - INFO - Deleted file file-RBs1sVu52BGJRXYP29Le4B
2024-12-03 18:33:04,257 - bespokelabs.curator.request_processor.openai_batch_request_processor - INFO - Deleted file file-MKZdvmugFwc46RLHf6q17H

You can also see files appear and disappear in the dashboard:
https://platform.openai.com/storage/files

@RyanMarten RyanMarten requested a review from vutrung96 December 4, 2024 02:35
@RyanMarten RyanMarten changed the base branch from main to dev December 4, 2024 02:35
@RyanMarten
Copy link
Contributor Author

The default in prompter is

        delete_successful_batch_files: bool = True,
        delete_failed_batch_files: bool = False,  # To allow users to debug failed batches

Test with this toggled off

from bespokelabs.curator import Prompter
from datasets import Dataset
import logging

# To see more detail about how batches are being processed
logger = logging.getLogger("bespokelabs.curator")
logger.setLevel(logging.INFO)
dataset = Dataset.from_dict({"prompt": ["write me a poem"] * 3})

prompter = Prompter(
    prompt_func=lambda row: row["prompt"],
    model_name="gpt-4o-mini",
    response_format=None,
    batch=True,
    batch_size=1,
    delete_successful_batch_files=False
)

dataset = prompter(dataset)
print(dataset.to_pandas())

You should expect to see the files still in the dashboard and no delete logs.

Copy link
Contributor

@vutrung96 vutrung96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot test given my unstable internet but the logic makes sense to me. LGTM but please make sure the tests work :D

@RyanMarten RyanMarten merged commit 44d5b72 into dev Dec 4, 2024
2 checks passed
@RyanMarten RyanMarten deleted the ryanm/delete-successful-batches branch December 4, 2024 03:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Out of Space error on OpenAI servers
2 participants