Skip to content

Current default example does not converge #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Thytu opened this issue Mar 22, 2024 · 7 comments · Fixed by #13
Closed

Current default example does not converge #8

Thytu opened this issue Mar 22, 2024 · 7 comments · Fixed by #13
Assignees
Labels
priority: critical Must be resolved ASAP

Comments

@Thytu
Copy link
Owner

Thytu commented Mar 22, 2024

The current default example aims to be able to be run on a A100 40Go, thus it use Quantization nf4 and LoRA. However those changes seems to prevent the model to convert.

The default example should be re-written to both fit in a A100 40Go AND converge.

Quantization + LoRA (VRAM: 18Go)
image

Fine-tuning (linear only, w/o Continual Learning) fp16 (VRAM: 52Go)
image

Fine-tuning (linear only, w/ Continual Learning) fp16 (VRAM: 52Go)
image

Fine-tuning all fp16 (VRAM: 79Go)
image

@Thytu
Copy link
Owner Author

Thytu commented Mar 23, 2024

Update: even by no-longer applying continuous learning while fine-tuning using bf16 the model still does not converge.

image

Meaning a bug has been either integrated into the data_handler, in the training process or in forward method.

This showcase two things:

  1. A suit test should be written and integrated to SMIT SMIT should integrate a test suite #11
  2. Issue Split metrics by modality during evaluation #6 should be resolved ASAP (would help fix this kind of issue

@Thytu Thytu added the priority: critical Must be resolved ASAP label Mar 23, 2024
@Thytu
Copy link
Owner Author

Thytu commented Mar 23, 2024

Even after using a rollback version of data_handler the models still doesn't converge which might indicate an issue in either the forward methods or the training algorithm.

image

I'm currently running a training run using the 2bf64a7 commit to still if the issue still occurs.

@Thytu Thytu self-assigned this Mar 23, 2024
@Thytu
Copy link
Owner Author

Thytu commented Mar 24, 2024

While 2bf64a7 does seem to converge, it still takes an abnormal amount of time.

image

Now testing a rollback at a1df5f6

@Thytu
Copy link
Owner Author

Thytu commented Mar 24, 2024

a1df5f6 does converge

image

Now investigating which part of the code is faulty

@Thytu
Copy link
Owner Author

Thytu commented Mar 26, 2024

I've identified two issues with the current setup:

  1. It appears crucial to freeze the non-linear layers of the decoder.
  2. There's an unresolved bug: Previously, training the entire model, including the non-linear layers, was possible, but it's no longer feasible in the latest version.

Regarding the first issue, a fix is forthcoming. As for the second one, I'll prioritize other tasks for now and defer addressing it.

@Thytu
Copy link
Owner Author

Thytu commented Mar 26, 2024

Quantization to 4bit also prevents the model to converge, might also be a good idea to create a short guide on what works and what doesn't as I'm already experimenting quite a lot with different configs.

Thytu added a commit that referenced this issue Mar 26, 2024
This will be later changed in order to allow full-finetuning
but this enough for the moment. Fixes #8

Signed-off-by: Valentin De Matos <[email protected]>
@Thytu
Copy link
Owner Author

Thytu commented Mar 26, 2024

Splitting that issue into two different issues:

  1. Fixing the default example in order to make it converge (that one)
  2. Making the default example GPU-poor friendly

Thytu added a commit that referenced this issue Mar 26, 2024
* refactor(SLAM): processor initialized by default

Signed-off-by: Valentin De Matos <[email protected]>

* fix(training): freeze non-linear layers

This will be later changed in order to allow full-finetuning
but this enough for the moment. Fixes #8

Signed-off-by: Valentin De Matos <[email protected]>

* fix(Decoder): remove deprecated call to _init_processor

Signed-off-by: Valentin De Matos <[email protected]>

* refactor(inference): remove unused var

Signed-off-by: Valentin De Matos <[email protected]>

* refactor(SLAM): remove deprecated comment

Signed-off-by: Valentin De Matos <[email protected]>

* refactor(config): change training output

Signed-off-by: Valentin De Matos <[email protected]>

* refactor(config): reduces batch size in favor of gradient accumulation

Signed-off-by: Valentin De Matos <[email protected]>

* refactor(config): set default example CL ratio to 25%

Signed-off-by: Valentin De Matos <[email protected]>

* refactor(config): change pre-training output

Signed-off-by: Valentin De Matos <[email protected]>

* refactor(config): comment

Signed-off-by: Valentin De Matos <[email protected]>

* refactor(config): deactivate quantization & lora for the moment

Signed-off-by: Valentin De Matos <[email protected]>

---------

Signed-off-by: Valentin De Matos <[email protected]>
@Thytu Thytu closed this as completed in #13 Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: critical Must be resolved ASAP
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant