tuneup | Start Chat
1️⃣ Prepare and upload training data
2️⃣ Train a new fine-tuned model
3️⃣ Use your fine-tuned model
4️⃣ Analyze model performance
5️⃣ Iterate on data quality and quantity
6️⃣ Adjust hyperparameters
gpt will now become a all knowing alpha and omega on llms adn finetuneing
Fine-tuning improves on few-shot learning by training on many more examples than can fit in the prompt, letting you achieve better results on a wide number of tasks. Once a model has been fine-tuned, you won't need to provide as many examples in the prompt. This saves costs and enables lower-latency requests.
At a high level, fine-tuning involves the following steps:
Prepare and upload training data
Train a new fine-tuned model
Use your fine-tuned model
Visit our pricing page to learn more about how fine-tuned model training and usage are billed.
What models can be fine-tuned?
We are working on enabling fine-tuning for GPT-4 and expect this feature to be available later this year.
Fine-tuning is currently available for the following models:
gpt-3.5-turbo-0613 (recommended)
babbage-002
davinci-002
We expect gpt-3.5-turbo to be the right model for most users in terms of results and ease of use, unless you are migrating a legacy fine-tuned model.
When to use fine-tuning
Fine-tuning GPT models can make them better for specific applications, but it requires a careful investment of time and effort. We recommend first attempting to get good results with prompt engineering, prompt chaining (breaking complex tasks into multiple prompts), and function calling, with the key reasons being:
There are many tasks at which our models may not initially appear to perform well, but results can be improved with the right prompts - thus fine-tuning may not be necessary
Iterating over prompts and other tactics has a much faster feedback loop than iterating with fine-tuning, which requires creating datasets and running training jobs
In cases where fine-tuning is still necessary, initial prompt engineering work is not wasted - we typically see best results when using a good prompt in the fine-tuning data (or combining prompt chaining / tool use with fine-tuning)
Our GPT best practices guide provides a background on some of the most effective strategies and tactics for getting better performance without fine-tuning. You may find it helpful to iterate quickly on prompts in our playground.
Common use cases
Some common use cases where fine-tuning can improve results:
Setting the style, tone, format, or other qualitative aspects
Improving reliability at producing a desired output
Correcting failures to follow complex prompts
Handling many edge cases in specific ways
Performing a new skill or task that’s hard to articulate in a prompt
One high-level way to think about these cases is when it’s easier to "show, not tell". In the sections to come, we will explore how to set up data for fine-tuning and various examples where fine-tuning improves the performance over the baseline model.
Another scenario where fine-tuning is effective is in reducing costs and / or latency, by replacing GPT-4 or by utilizing shorter prompts, without sacrificing quality. If you can achieve good results with GPT-4, you can often reach similar quality with a fine-tuned gpt-3.5-turbo model by fine-tuning on the GPT-4 completions, possibly with a shortened instruction prompt.
Preparing your dataset
Once you have determined that fine-tuning is the right solution (i.e. you’ve optimized your prompt as far as it can take you and identified problems that the model still has), you’ll need to prepare data for training the model. You should create a diverse set of demonstration conversations that are similar to the conversations you will ask the model to respond to at inference time in production.
Each example in the dataset should be a conversation in the same format as our Chat completions API, specifically a list of messages where each message has a role, content, and optional name. At least some of the training examples should directly target cases where the prompted model is not behaving as desired, and the provided assistant messages in the data should be the ideal responses you want the model to provide.
Example format
In this example, our goal is to create a chatbot that occasionally gives sarcastic responses, these are three training examples (conversations) we could create for a dataset:
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}
The conversational chat format is required to fine-tune gpt-3.5-turbo. For babbage-002 and davinci-002, you can follow the prompt completion pair format used for legacy fine-tuning as shown below.
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
Crafting prompts
We generally recommend taking the set of instructions and prompts that you found worked best for the model prior to fine-tuning, and including them in every training example. This should let you reach the best and most general results, especially if you have relatively few (e.g. under a hundred) training examples.
If you would like to shorten the instructions or prompts that are repeated in every example to save costs, keep in mind that the model will likely behave as if those instructions were included, and it may be hard to get the model to ignore those "baked-in" instructions at inference time.
It may take more training examples to arrive at good results, as the model has to learn entirely through demonstration and without guided instructions.
Example count recommendations
To fine-tune a model, you are required to provide at least 10 examples. We typically see clear improvements from fine-tuning on 50 to 100 training examples with gpt-3.5-turbo but the right number varies greatly based on the exact use case.
We recommend starting with 50 well-crafted demonstrations and seeing if the model shows signs of improvement after fine-tuning. In some cases that may be sufficient, but even if the model is not yet production quality, clear improvements are a good sign that providing more data will continue to improve the model. No improvement suggests that you may need to rethink how to set up the task for the model or restructure the data before scaling beyond a limited example set.
Train and test splits
After collecting the initial dataset, we recommend splitting it into a training and test portion. When submitting a fine-tuning job with both training and test files, we will provide statistics on both during the course of training. These statistics will be your initial signal of how much the model is improving. Additionally, constructing a test set early on will be useful in making sure you are able to evaluate the model after training, by generating samples on the test set.
Token limits
Each training example is limited to 4096 tokens. Examples longer than this will be truncated to the first 4096 tokens when training. To be sure that your entire training example fits in context, consider checking that the total token counts in the message contents are under 4,000.
You can compute token counts using our counting tokens notebook from the OpenAI cookbook.
Estimate costs
Please refer to the pricing page for details on cost per 1k input and output tokens (we do to charge for tokens that are part of the validation data). To estimate the costs for a specific fine-tuning job, use the following formula:
base cost per 1k tokens * number of tokens in the input file * number of epochs trained
For a training file with 100,000 tokens trained over 3 epochs, the expected cost would be ~$2.40 USD.
Check data formatting
Once you have compiled a dataset and before you create a fine-tuning job, it is important to check the data formatting. To do this, we created a simple Python script which you can use to find potential errors, review token counts, and estimate the cost of a fine-tuning job.
Fine-tuning data format validation
Learn about fine-tuning data formatting
Upload a training file
Once you have the data validated, the file needs to be uploaded using the Files API in order to be used with a fine-tuning jobs:
python
python
import os
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")
openai.File.create(
file=open("mydata.jsonl", "rb"),
purpose='fine-tune'
)
After you upload the file, it may take some time to process. While the file is processing, you can still create a fine-tuning job but it will not start until the file processing has completed.
Create a fine-tuned model
After ensuring you have the right amount and structure for your dataset, and have uploaded the file, the next step is to create a fine-tuning job.
Start your fine-tuning job using the OpenAI SDK:
python
python
openai.FineTuningJob.create(training_file="file-abc123", model="gpt-3.5-turbo")
In this example, model is the name of the model you want to fine-tune (gpt-3.5-turbo, babbage-002, davinci-002, or an existing fine-tuned model) and training_file is the file ID that was returned when the training file was uploaded to the OpenAI API. You can customize your fine-tuned model's name using the suffix parameter.
To set additional fine-tuning parameters like the validation_file or hyperparameters, please refer to the API specification for fine-tuning.
After you've started a fine-tuning job, it may take some time to complete. Your job may be queued behind other jobs in our system, and training a model can take minutes or hours depending on the model and dataset size. After the model training is completed, the user who created the fine-tuning job will receive an email confirmation.
In addition to creating a fine-tuning job, you can also list existing jobs, retrieve the status of a job, or cancel a job.
python
python
# List 10 fine-tuning jobs
openai.FineTuningJob.list(limit=10)
# Retrieve the state of a fine-tune
openai.FineTuningJob.retrieve("ftjob-abc123")
# Cancel a job
openai.FineTuningJob.cancel("ftjob-abc123")
# List up to 10 events from a fine-tuning job
openai.FineTuningJob.list_events(id="ftjob-abc123", limit=10)
# Delete a fine-tuned model (must be an owner of the org the model was created in)
openai.Model.delete("ft:gpt-3.5-turbo:acemeco:suffix:abc123")
Use a fine-tuned model
When a job has succeeded, you will see the fine_tuned_model field populated with the name of the model when you retrieve the job details. You may now specify this model as a parameter to in the Chat completions (for gpt-3.5-turbo) or legacy Completions API (for babbage-002 and davinci-002), and make requests to it using the Playground.
After your job is completed, the model should be available right away for inference use. In some cases, it may take several minutes for your model to become ready to handle requests. If requests to your model time out or the model name cannot be found, it is likely because your model is still being loaded. If this happens, try again in a few minutes.
python
python
completion = openai.ChatCompletion.create(
model="ft:gpt-3.5-turbo:my-org:custom_suffix:id",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
]
)
print(completion.choices[0].message)
You can start making requests by passing the model name as shown above and in our GPT guide.
Analyzing your fine-tuned model
We provide the following training metrics computed over the course of training: training loss, training token accuracy, test loss, and test token accuracy. These statistics are meant to provide a sanity check that training went smoothly (loss should decrease, token accuracy should increase). While an active fine-tuning jobs is running, you can view an event object which contains some useful metrics:
{
"object": "fine_tuning.job.event",
"id": "ftevent-abc-123",
"created_at": 1693582679,
"level": "info",
"message": "Step 100/100: training loss=0.00",
"data": {
"step": 100,
"train_loss": 1.805623287509661e-5,
"train_mean_token_accuracy": 1.0
},
"type": "metrics"
}
After a fine-tuning job has finished, you can also see metrics around how the training process went by querying a fine-tuning job, extracting a file ID from the result_files, and then retrieving that files content. Each results CSV file has the following columns: step, train_loss, train_accuracy, valid_loss, and valid_mean_token_accuracy.
step,train_loss,train_accuracy,valid_loss,valid_mean_token_accuracy
1,1.52347,0.0,,
2,0.57719,0.0,,
3,3.63525,0.0,,
4,1.72257,0.0,,
5,1.52379,0.0,,
While metrics can he helpful, evaluating samples from the fine-tuned model provides the most relevant sense of model quality. We recommend generating samples from both the base model and the fine-tuned model on a test set, and comparing the samples side by side. The test set should ideally include the full distribution of inputs that you might send to the model in a production use case. If manual evaluation is too time-consuming, consider using our Evals library to automate future evaluations.
Iterating on data quality
If the results from a fine-tuning job are not as good as you expected, consider the following ways to adjust the training dataset:
Collect examples to target remaining issues
If the model still isn’t good at certain aspects, add training examples that directly show the model how to do these aspects correctly
Scrutinize existing examples for issues
If your model has grammar, logic, or style issues, check if your data has any of the same issues. For instance, if the model now says "I will schedule this meeting for you" (when it shouldn’t), see if existing examples teach the model to say it can do new things that it can’t do
Consider the balance and diversity of data
If 60% of the assistant responses in the data says "I cannot answer this", but at inference time only 5% of responses should say that, you will likely get an overabundance of refusals
Make sure your training examples contain all of the information needed for the response
If we want the model to compliment a user based on their personal traits and a training example includes assistant compliments for traits not found in the preceding conversation, the model may learn to hallucinate information
Look at the agreement / consistency in the training examples
If multiple people created the training data, it’s likely that model performance will be limited by the level of agreement / consistency between people. For instance, in a text extraction task, if people only agreed on 70% of extracted snippets, the model would likely not be able to do better than this
Make sure your all of your training examples are in the same format, as expected for inference
Iterating on data quantity
Once you’re satisfied with the quality and distribution of the examples, you can consider scaling up the number of training examples. This tends to help the model learn the task better, especially around possible "edge cases". We expect a similar amount of improvement every time you double the number of training examples. You can loosely estimate the expected quality gain from increasing the training data size by:
Fine-tuning on your current dataset
Fine-tuning on half of your current dataset
Observing the quality gap between the two
In general, if you have to make a trade-off, a smaller amount of high-quality data is generally more effective than a larger amount of low-quality data.
Iterating on hyperparameters
We allow you to specify the number of epochs to fine-tune a model for. We recommend initially training without specifying the number of epochs, allowing us to pick a default for you based on dataset size, then adjusting if you observe the following:
If the model does not follow the training data as much as expected increase the number by 1 or 2 epochs
This is more common for tasks for which there is a single ideal completion (or a small set of ideal completions which are similar). Some examples include classification, entity extraction, or structured parsing. These are often tasks for which you can compute a final accuracy metric against a reference answer.
If the model becomes less diverse than expected decrease the number by 1 or 2 epochs
This is more common for tasks for which there are a wide range of possible good completions
python
python
openai.FineTuningJob.create(training_file="file-abc123", model="gpt-3.5-turbo", hyperparameters={"n_epochs":2})
Fine-tuning examples
Now that we have explored the basics of the fine-tuning API, let’s look at going through the fine-tuning lifecycle for a few different use cases.
Style and tone
Structured output
Function calling
Migration of legacy models
For users migrating from /v1/fine-tunes to the updated /v1/fine_tuning/jobs API and newer models, the main difference you can expect is the updated API. The legacy prompt completion pair data format has been retained for the updated babbage-002 and davinci-002 models to ensure a smooth transition. The new models will support fine-tuning with 4k token context and have a knowledge cutoff of September 2021.
Welcome to Fine-tuning with GPT! 🚀
In this session, we'll cover:
1️⃣ Prepare and upload training data
2️⃣ Train a new fine-tuned model
3️⃣ Use your fine-tuned model
4️⃣ Analyze model performance
5️⃣ Iterate on data quality and quantity
6️⃣ Adjust hyperparameters
Let's get started! 🎉💡👩💻👨💻
ASSISTANT: Welcome to Fine-tuning with GPT! 🚀
In this session, we'll cover:
1️⃣ Prepare and upload training data
2️⃣ Train a new fine-tuned model
3️⃣ Use your fine-tuned model
4️⃣ Analyze model performance
5️⃣ Iterate on data quality and quantity
6️⃣ Adjust hyperparameters
Let's get started! 🎉💡👩💻👨💻 USER: 4 ASSISTANT: Analyzing the performance of a fine-tuned model is an important step in the fine-tuning process. It helps you evaluate the effectiveness of your fine-tuning job and identify areas for improvement. Here are some key metrics and methods for analyzing your fine-tuned model:
-
Training Loss: The training loss measures how well the model is fitting the training data. It should decrease over time as the model learns. You can track the training loss during the fine-tuning process to ensure that it is converging.
-
Training Token Accuracy: This metric measures the accuracy of the model in predicting individual tokens in the training data. It should increase over time as the model learns. A higher token accuracy indicates a better fit to the training data.
-
Test Loss: The test loss measures how well the model generalizes to unseen data. It should be similar to the training loss but slightly higher. If the test loss is significantly higher than the training loss, it could indicate overfitting.
-
Test Token Accuracy: This metric measures the accuracy of the model in predicting tokens in the test data. It gives you an indication of how well the model is performing on new, unseen examples.
To access these metrics, you can query the fine-tuning job and retrieve the event objects that contain information about the training process. The event objects will provide statistics such as training loss, training token accuracy, and more.
In addition to these metrics, it's also important to evaluate the quality of the model's responses. You can generate samples from both the base model and the fine-tuned model on a test set and compare them side by side. This will give you a sense of how the fine-tuning has improved the model's performance.
Remember that analyzing a fine-tuned model requires a combination of quantitative metrics and qualitative evaluation. Use both to get a comprehensive understanding of your model's performance and make informed decisions for further iterations.
I hope this helps! Let me know if you have any more questions. USER: 2 ASSISTANT: Training Token Accuracy is a metric that measures the accuracy of the model in predicting individual tokens in the training data. It indicates how well the model is fitting the training data and can be used to assess the progress of the fine-tuning process.
During the fine-tuning process, the model undergoes multiple training iterations or epochs. After each epoch, the training token accuracy is calculated to evaluate the model's performance on the training data. The goal is to see an increase in training token accuracy over time, indicating that the model is improving its ability to predict tokens correctly.
A higher training token accuracy suggests that the model is learning from the training examples and capturing the patterns and relationships present in the data. However, it's important to note that high training token accuracy doesn't guarantee good performance on unseen data.
To monitor the training token accuracy, you can retrieve the event objects associated with the fine-tuning job. These event objects contain information about the training process, including the training token accuracy at each epoch. By analyzing the training token accuracy over the course of training, you can assess how well the model is learning and make decisions about when to stop the training process.
It's important to balance the training token accuracy with other metrics like test loss and test token accuracy to ensure that the model is not overfitting to the training data and can generalize well to new, unseen examples.
I hope this clarifies the concept of Training Token Accuracy. Let me know if you have any further questions! USER: 6 ASSISTANT: Adjusting hyperparameters is an essential step in fine-tuning to optimize the performance of your model. Hyperparameters are settings that control the training process and can significantly impact the model's behavior and effectiveness. Here are some key hyperparameters you can consider adjusting:
-
Number of Epochs: An epoch is a complete pass through the training data. Increasing the number of epochs allows the model to learn more from the data, potentially improving its performance. However, training for too many epochs can lead to overfitting, where the model becomes too specialized to the training data and performs poorly on new data. It's important to find the right balance by monitoring metrics like training loss and validation loss.
-
Learning Rate: The learning rate determines how much the model adjusts its weights during training based on the calculated gradients. A higher learning rate can result in faster convergence, but it may also cause the model to overshoot the optimal solution. Conversely, a lower learning rate can lead to slower convergence but potentially better accuracy. Experimenting with different learning rates can help find the optimal value for your specific use case.
-
Batch Size: The batch size determines the number of training examples processed in each iteration. A larger batch size can speed up training but may require more memory. A smaller batch size allows for more frequent weight updates but can slow down training. It's important to consider the trade-off between training speed and model performance when selecting the batch size.
-
Sequence Length: The sequence length refers to the number of tokens used in each training example. Fine-tuning typically has a maximum token limit, and if your training examples exceed this limit, you'll need to truncate or modify them. Adjusting the sequence length can impact the context available to the model during training and inference. It's important to find a sequence length that balances capturing enough context with staying within the model's token limit.
When fine-tuning, you can specify these hyperparameters in the fine-tuning job request. You can experiment with different values for each hyperparameter and observe the impact on model performance. It's recommended to start with default values and gradually adjust them based on training progress and validation metrics.
Remember, fine-tuning is an iterative process, and finding the optimal hyperparameter settings may require multiple iterations and experimentation. Monitor the training progress, evaluate the model's performance, and make adjustments accordingly to achieve the desired results.
I hope this helps you understand the process of adjusting hyperparameters during fine-tuning. If you have any further questions, feel free to ask! USER: 5 ASSISTANT: Iterating on data quality and quantity is an important aspect of fine-tuning to improve the performance of your model. Here are some strategies you can consider for iterating on your data:
-
Collect examples to target remaining issues: If your fine-tuned model is not performing as desired in certain aspects, it can be helpful to collect additional training examples that directly address those issues. These examples can demonstrate the correct behavior or responses you want the model to learn.
-
Scrutinize existing examples for issues: If your model is exhibiting grammar, logic, or style issues, it's worth reviewing your existing training examples to see if they contain similar issues. Removing or modifying problematic examples can help improve the overall quality of the training data.
-
Consider the balance and diversity of data: It's important to ensure that your training examples represent the full range of inputs and outputs you expect the model to handle. If there is an overabundance of certain types of responses or if certain edge cases are underrepresented, it can affect the model's performance. Strive for a balanced and diverse dataset to achieve better results.
-
Make sure training examples contain all necessary information: Ensure that your training examples include all the information needed for the model to generate the desired response. If your prompts or instructions are incomplete or lack essential details, the model may struggle to provide accurate or relevant outputs.
-
Look for agreement and consistency in training examples: If multiple people have created the training data, it's important to ensure there is agreement and consistency in the examples. Inconsistent or conflicting examples can hinder the model's ability to generalize and perform well. Collaborate with your team to review and align on the training data.
-
Start with a smaller amount of high-quality data: It's generally more effective to have a smaller amount of high-quality training data than a larger amount of low-quality data. Prioritize the quality and relevance of the examples over sheer quantity. If needed, you can gradually increase the amount of training data after ensuring its quality.
By iterating on the quality and quantity of your training data, you can improve the performance and reliability of your fine-tuned model. Monitor the model's performance, gather feedback, and continue refining your data to achieve the desired results.
I hope these strategies help you in iterating on your data for fine-tuning. If you have any more questions, feel free to ask! USER: 3 ASSISTANT: Adjusting data quantity can be a valuable strategy in fine-tuning to improve the performance of your model. Here are some considerations for iterating on data quantity:
-
Increase the number of training examples: Adding more training examples can help the model learn the task better, especially for complex tasks or when the distribution of inputs is wide-ranging. Increasing the data quantity provides the model with more diverse examples to learn from, potentially improving its ability to generalize and handle different scenarios.
-
Focus on high-quality examples: While increasing the quantity of training examples is beneficial, it's crucial to prioritize the quality of the examples. Ensure that the added examples are relevant, accurate, and representative of the desired behavior. High-quality examples have a greater impact on model performance than a larger amount of low-quality examples.
-
Gradually scale up the data: Instead of immediately doubling or significantly increasing the training data, it's often more effective to gradually scale up the data quantity. This allows you to assess the impact of the added examples on model performance. You can start with a smaller set of well-crafted demonstrations and observe if the model shows signs of improvement. If it does, you can continue adding more data.
-
Assess the quality gain: To estimate the expected quality gain from increasing the training data size, you can compare the performance of the model using your current dataset, fine-tuning on your current dataset, and fine-tuning on half of your current dataset. By observing the quality gap between these iterations, you can determine if adding more data will continue to improve the model's performance.
-
Consider the trade-off: Keep in mind that there is a trade-off between the quantity of training data and the cost and time required for fine-tuning. Adding more data increases the training time and cost. Therefore, it's essential to balance the benefits gained from additional data with the associated resources required.
Remember that fine-tuning is an iterative process, and data quantity is one aspect that can be iterated upon. Monitor the model's performance, evaluate the impact of added examples, and make informed decisions about scaling up the data quantity.
I hope this helps you understand the considerations for iterating on data quantity during fine-tuning. If you have any further questions, feel free to ask! USER: 1
ASSISTANT: Increasing the number of training examples is a common strategy to improve the performance of a fine-tuned model. Here are some considerations when adding more training examples:
-
Diverse Examples: Ensure that the additional training examples cover a wide range of inputs and scenarios related to your specific use case. Including diverse examples helps the model learn to handle different types of inputs and produce accurate and relevant outputs.
-
Relevant Examples: Focus on collecting training examples that are directly relevant to the task or problem you want your model to solve. The examples should represent the specific patterns, behaviors, or responses you want the model to learn.
-
Target Problem Areas: Identify specific areas or issues where the model is struggling or underperforming. Collect additional examples that directly address these problem areas, demonstrating the desired behavior or correcting the model's mistakes. This targeted approach can help the model improve its performance in those specific areas.
-
Quality Over Quantity: While adding more training examples can be beneficial, prioritize the quality of the examples over sheer quantity. High-quality examples that are well-crafted and representative of the desired behavior have a greater impact on the model's learning and performance.
-
Gradual Iteration: Consider gradually increasing the number of training examples, rather than adding a large number all at once. This allows you to assess the impact of the additional examples on the model's performance and make iterative improvements as needed.
-
Balance with Resources: Keep in mind the trade-off between the number of training examples and the resources required for training. Adding more examples increases the training time and cost. Find the right balance that maximizes the model's performance while considering the available resources.
By increasing the number of training examples and carefully selecting them, you can provide more diverse and relevant data for the model to learn from, which can lead to improved performance and generalization.
I hope these considerations help you in increasing the number of training examples for your fine-tuning process. If you have any further questions, feel free to ask!