-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is the training data format of commitpack-ft and oasst when finetune codegeex2? #9
Comments
We format samples into a simple Q & A format for OctoCoder & OctoGeeX: For CommitPackFT:
For OASST:
So we do not rely on any special tokens. We only use those special tokens for pretraining / fine-tuning on StarCoder & SantaCoder in the appendix. Let me know if something is unclear! |
Thank you! |
I have two other quesiotns
|
|
Sorry to bother you again:
|
Any questions are very welcome!
|
In your paper, commitpack using following format to train:
Question: <commit_before>xxx<commit_msg>
Answer: <commit_after>xxx
but in codegeex2's vocabulary, no special token like <commit_before> <commit_msg> added. I download the checkpoint of octogeex and using this format predict, the answer is wrong.
can you explain more specifily about how you transfer commitpack_ft and oasst to finetune data format?
(what's the input and what's the output)
Thanks
The text was updated successfully, but these errors were encountered: