Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding training your own model #34

Open
Aniketto16 opened this issue Mar 6, 2024 · 1 comment
Open

Regarding training your own model #34

Aniketto16 opened this issue Mar 6, 2024 · 1 comment

Comments

@Aniketto16
Copy link

Hello!
Thank you for your great work, I have following question :

I have my own Elyza7B checkpoint that I want to finetune on VQA task. If I follow the llava training scheme closely, I think we need to perform projection pretraining and then finetuning on chat task.
From the documentation I don't understand which dataset should I use, should I directly use llava_ja or first mi3t and then llava_ja ?
Also what is the difference between instruct and normal datasets ?
Can you clarify, this would be really helpful!

Thank you so much again, looking forward to your reply!!

@Ino-Ichan
Copy link
Contributor

@Aniketto16
Hi!

Thank you very much for your interest and for the kind words regarding our work! Regarding your question about finetuning your Elyza7B checkpoint for the VQA task, let's delve into some clarifications and recommendations.

Firstly, adopting a training scheme similar to LLaVA, where projection pretraining is followed by comprehensive LLM finetuning, could indeed be effective. However, we currently do not have a publicly available Japanese dataset for LLaVA pretraining. This means that directly mimicking the LLaVA training approach is not feasible at the moment (we are working on this, so please stay tuned for future updates).

From our experiments, we've found that finetuning both the projection and the LLM together, without separate projection pretraining, can also yield satisfactory results. This approach involves utilizing the full parameters for both components during finetuning, and we recommend giving it a try.
See here.

Of course, pretraining on a Japanese Vision-Language dataset before proceeding to finetune the LLM on a specific VQA dataset is another strategy that is likely to be effective.

Regarding the distinction between "normal" and "instruct" datasets, the key difference lies in how the loss is calculated. For "normal" datasets, the loss is calculated across all input texts, whereas for "instruct" datasets, the loss is specifically calculated for the model's answers only. Instruction tuning typically utilizes "instruct" datasets, aiming to refine the model's ability to answer questions. We provide implementations for both types as a reference. While training with "normal" datasets can also be successful, it might lead to a model that tends to replicate human-like conversational patterns.

We hope this clarifies your queries and aids in your finetuning endeavors. Please feel free to reach out if you have further questions. We're excited to see the advancements you'll make with your project!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants