-
Notifications
You must be signed in to change notification settings - Fork 4
Current default example does not converge #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Update: even by no-longer applying continuous learning while fine-tuning using bf16 the model still does not converge. ![]() Meaning a bug has been either integrated into the data_handler, in the training process or in forward method. This showcase two things:
|
Even after using a rollback version of ![]() I'm currently running a training run using the 2bf64a7 commit to still if the issue still occurs. |
a1df5f6 does converge ![]() Now investigating which part of the code is faulty |
I've identified two issues with the current setup:
Regarding the first issue, a fix is forthcoming. As for the second one, I'll prioritize other tasks for now and defer addressing it. |
Quantization to 4bit also prevents the model to converge, might also be a good idea to create a short guide on what works and what doesn't as I'm already experimenting quite a lot with different configs. |
This will be later changed in order to allow full-finetuning but this enough for the moment. Fixes #8 Signed-off-by: Valentin De Matos <[email protected]>
Splitting that issue into two different issues:
|
* refactor(SLAM): processor initialized by default Signed-off-by: Valentin De Matos <[email protected]> * fix(training): freeze non-linear layers This will be later changed in order to allow full-finetuning but this enough for the moment. Fixes #8 Signed-off-by: Valentin De Matos <[email protected]> * fix(Decoder): remove deprecated call to _init_processor Signed-off-by: Valentin De Matos <[email protected]> * refactor(inference): remove unused var Signed-off-by: Valentin De Matos <[email protected]> * refactor(SLAM): remove deprecated comment Signed-off-by: Valentin De Matos <[email protected]> * refactor(config): change training output Signed-off-by: Valentin De Matos <[email protected]> * refactor(config): reduces batch size in favor of gradient accumulation Signed-off-by: Valentin De Matos <[email protected]> * refactor(config): set default example CL ratio to 25% Signed-off-by: Valentin De Matos <[email protected]> * refactor(config): change pre-training output Signed-off-by: Valentin De Matos <[email protected]> * refactor(config): comment Signed-off-by: Valentin De Matos <[email protected]> * refactor(config): deactivate quantization & lora for the moment Signed-off-by: Valentin De Matos <[email protected]> --------- Signed-off-by: Valentin De Matos <[email protected]>
The current default example aims to be able to be run on a
A100
40Go, thus it use Quantizationnf4
andLoRA
. However those changes seems to prevent the model to convert.The default example should be re-written to
both fit in aconverge.A100
40Go ANDQuantization + LoRA (VRAM: 18Go)

Fine-tuning (linear only, w/o Continual Learning) fp16 (VRAM: 52Go)

Fine-tuning (linear only, w/ Continual Learning) fp16 (VRAM: 52Go)

Fine-tuning all fp16 (VRAM: 79Go)

The text was updated successfully, but these errors were encountered: