Current default example does not converge #8

Thytu · 2024-03-22T09:13:11Z

The current default example aims to be able to be run on a A100 40Go, thus it use Quantization nf4 and LoRA. However those changes seems to prevent the model to convert.

The default example should be re-written to ~~both fit in a A100 40Go AND~~ converge.

Quantization + LoRA (VRAM: 18Go)

Fine-tuning (linear only, w/o Continual Learning) fp16 (VRAM: 52Go)

Fine-tuning (linear only, w/ Continual Learning) fp16 (VRAM: 52Go)

Fine-tuning all fp16 (VRAM: 79Go)

The text was updated successfully, but these errors were encountered:

Thytu · 2024-03-23T14:39:44Z

Update: even by no-longer applying continuous learning while fine-tuning using bf16 the model still does not converge.

Meaning a bug has been either integrated into the data_handler, in the training process or in forward method.

This showcase two things:

A suit test should be written and integrated to SMIT SMIT should integrate a test suite #11
Issue Split metrics by modality during evaluation #6 should be resolved ASAP (would help fix this kind of issue

Thytu · 2024-03-23T22:12:27Z

Even after using a rollback version of data_handler the models still doesn't converge which might indicate an issue in either the forward methods or the training algorithm.

I'm currently running a training run using the 2bf64a7 commit to still if the issue still occurs.

Thytu · 2024-03-24T07:52:20Z

While 2bf64a7 does seem to converge, it still takes an abnormal amount of time.

Now testing a rollback at a1df5f6

Thytu · 2024-03-24T08:45:21Z

a1df5f6 does converge

Now investigating which part of the code is faulty

Thytu · 2024-03-26T08:34:21Z

I've identified two issues with the current setup:

It appears crucial to freeze the non-linear layers of the decoder.
There's an unresolved bug: Previously, training the entire model, including the non-linear layers, was possible, but it's no longer feasible in the latest version.

Regarding the first issue, a fix is forthcoming. As for the second one, I'll prioritize other tasks for now and defer addressing it.

Thytu · 2024-03-26T10:51:13Z

Quantization to 4bit also prevents the model to converge, might also be a good idea to create a short guide on what works and what doesn't as I'm already experimenting quite a lot with different configs.

This will be later changed in order to allow full-finetuning but this enough for the moment. Fixes #8 Signed-off-by: Valentin De Matos <[email protected]>

Thytu · 2024-03-26T17:04:39Z

Splitting that issue into two different issues:

Fixing the default example in order to make it converge (that one)
Making the default example GPU-poor friendly

* refactor(SLAM): processor initialized by default Signed-off-by: Valentin De Matos <[email protected]> * fix(training): freeze non-linear layers This will be later changed in order to allow full-finetuning but this enough for the moment. Fixes #8 Signed-off-by: Valentin De Matos <[email protected]> * fix(Decoder): remove deprecated call to _init_processor Signed-off-by: Valentin De Matos <[email protected]> * refactor(inference): remove unused var Signed-off-by: Valentin De Matos <[email protected]> * refactor(SLAM): remove deprecated comment Signed-off-by: Valentin De Matos <[email protected]> * refactor(config): change training output Signed-off-by: Valentin De Matos <[email protected]> * refactor(config): reduces batch size in favor of gradient accumulation Signed-off-by: Valentin De Matos <[email protected]> * refactor(config): set default example CL ratio to 25% Signed-off-by: Valentin De Matos <[email protected]> * refactor(config): change pre-training output Signed-off-by: Valentin De Matos <[email protected]> * refactor(config): comment Signed-off-by: Valentin De Matos <[email protected]> * refactor(config): deactivate quantization & lora for the moment Signed-off-by: Valentin De Matos <[email protected]> --------- Signed-off-by: Valentin De Matos <[email protected]>

Thytu added the priority: critical Must be resolved ASAP label Mar 23, 2024

Thytu mentioned this issue Mar 23, 2024

SMIT should integrate a test suite #11

Open

Thytu self-assigned this Mar 23, 2024

Thytu added a commit that referenced this issue Mar 26, 2024

fix(training): freeze non-linear layers

8d3bc05

This will be later changed in order to allow full-finetuning but this enough for the moment. Fixes #8 Signed-off-by: Valentin De Matos <[email protected]>

This was referenced Mar 26, 2024

SMIT default example should be GPU-poor friendly #12

Closed

fix: default example now converges #13

Merged

Thytu closed this as completed in #13 Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Current default example does not converge #8

Current default example does not converge #8

Thytu commented Mar 22, 2024 •

edited

Loading

Thytu commented Mar 23, 2024 •

edited

Loading

Thytu commented Mar 23, 2024 •

edited

Loading

Thytu commented Mar 24, 2024

Thytu commented Mar 24, 2024

Thytu commented Mar 26, 2024

Thytu commented Mar 26, 2024

Thytu commented Mar 26, 2024

Current default example does not converge #8

Current default example does not converge #8

Comments

Thytu commented Mar 22, 2024 • edited Loading

Thytu commented Mar 23, 2024 • edited Loading

Thytu commented Mar 23, 2024 • edited Loading

Thytu commented Mar 24, 2024

Thytu commented Mar 24, 2024

Thytu commented Mar 26, 2024

Thytu commented Mar 26, 2024

Thytu commented Mar 26, 2024

Thytu commented Mar 22, 2024 •

edited

Loading

Thytu commented Mar 23, 2024 •

edited

Loading

Thytu commented Mar 23, 2024 •

edited

Loading