[WIP] Gemma3 support. #2485

krammnic · 2025-03-12T11:39:58Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Please link to any issues this PR addresses.

Changelog

What are the changes made in this PR?

Gemma3 #2484

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API
I have added an example to docs or docstrings

pytorch-bot · 2025-03-12T11:40:01Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2485

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

krammnic · 2025-03-12T11:43:47Z

Recipes, recipe registry
Component builders (with all changes according to the tech report)
Model builders (with all correct params according to the tech report)
SigLIP
Convert weight
Multimodal model
Tokenizer? (I assume that it is almost same as in Gemma2)
Was able to run 1B without multimodal
Manual runs for multimodal versions

krammnic · 2025-03-12T17:03:03Z

Was able to run full on 1B (without multimodal)

felipemello1 · 2025-03-12T18:02:31Z

Hey Mark, great work! As a sanity check, do you think you could compare vs HF for correctness with a larger sentence?

You can follow this as a start: https://gist.github.com/felipemello1/e3f1b1c358e145c7a4d610cf44cca374

krammnic · 2025-03-12T18:13:56Z

Hey Mark, great work! As a sanity check, do you think you could compare vs HF for correctness with a larger sentence?

You can follow this as a start: https://gist.github.com/felipemello1/e3f1b1c358e145c7a4d610cf44cca374

Hey Felipe! Yep, sure, it is still WIP until I will be confident (we will do multimodal runs) and some configs will be fixed by Gemma team

bzz · 2025-04-04T08:30:53Z

Curious about

some configs will be fixed by Gemma team

any chance it refers to huggingface/transformers#36683 ?
That seems to have been patched in Transformers

krammnic · 2025-04-06T07:34:19Z

Curious about

some configs will be fixed by Gemma team

any chance it refers to huggingface/transformers#36683 ? That seems to have been patched in Transformers

Hey @bzz. Exactly, the issue is very similar. Unfortunately, we require more information on the conversion stage from the config, which is missing in 4b config.

ebsmothers

Thanks for your patience on this one! A couple high-level comments:

The whole text-only 1B vs multimodal 4B+ thing is a bit awkward. I feel like in this PR you wrote a bunch of the SigLIP components but didn't really hook them up to anything, and so the actual builders are all text-only. I think that's fine if there's still stuff up in the air (viz. (2)), but wonder if we should do something similar to our other multimodal models: provide gemma_decoder, gemma_vision_encoder, then for 4B+ we hook into EarlyFusion and for 1B we just use the decoder directly.
Can you share more on some of the blockers around HF config for 4B+ models you were alluding to? I want to understand how much we should try to hack around things vs just hold off here.

ebsmothers · 2025-04-11T21:18:36Z

torchtune/modules/transformer.py

@@ -120,7 +120,7 @@ def forward(
        # Norm applied before self-attention
        h = self.sa_norm(x)
        attn_out = self.attn(h, h, mask=mask, input_pos=input_pos)
-
+         


nit: just lint this file to remove the whitespace changes

ebsmothers · 2025-04-11T21:32:25Z

recipes/configs/gemma3/27_qlora_single_device.yaml

@@ -0,0 +1,114 @@
+# Config for multi-device QLoRA finetuning in lora_finetune_single_device.py


Need to rename this file: 27_qlora_single_device.yaml -> 27B_qlora_single_device.yaml

ebsmothers · 2025-04-11T21:32:56Z

recipes/configs/gemma3/27_qlora_single_device.yaml

+# Tokenizer
+tokenizer:
+  _component_: torchtune.models.gemma.gemma_tokenizer
+  path: /tmp/gemma-3-4=12b-it/tokenizer.model


I think this path needs to be changed?

ebsmothers · 2025-04-11T21:34:25Z

recipes/configs/gemma3/27B_lora.yaml

+  checkpoint_files: [
+    model-00001-of-00012.safetensors,
+    model-00002-of-00012.safetensors,
+    model-00003-of-00012.safetensors,
+    model-00004-of-00012.safetensors,
+    model-00005-of-00012.safetensors,
+    model-00006-of-00012.safetensors,
+    model-00007-of-00012.safetensors,
+    model-00008-of-00012.safetensors,
+    model-00009-of-00012.safetensors,
+    model-00010-of-00012.safetensors,
+    model-00011-of-00012.safetensors,
+    model-00012-of-00012.safetensors,
+  ]


For ones that are a bit longer, you can also do this. Personally I prefer it whenever there are >5 files, but no strong preference here

ebsmothers · 2025-04-11T22:06:26Z

recipes/configs/gemma3/12B_qlora_single_device.yaml

+# Tokenizer
+tokenizer:
+  _component_: torchtune.models.gemma.gemma_tokenizer
+  path: /tmp/gemma-3-4=12b-it/tokenizer.model


need to update this one too

ebsmothers · 2025-04-11T23:27:11Z

torchtune/models/gemma3/_component_builders.py

+            q_norm=GemmaRMSNorm(head_dim, eps=norm_eps),
+            attn_dropout=attn_dropout,
+            # perform global only on the each 6 layer, according to the tech-report
+            sliding_window_size=sliding_window_size if (layer_idx % 6) != 0 or layer_idx == 0 else None,


Now that both this and Llama4 are doing this interleaving of local and global attention layers, we should think about whether there's a more general abstraction we can use to make this easier. (No need to worry about it for this PR though)

Let's do it in a follow up

ebsmothers · 2025-04-11T23:33:55Z

torchtune/models/gemma3/_component_builders.py

+    local_rope = RotaryPositionalEmbeddings(dim=head_dim, max_seq_len=max_seq_len, base=local_rope_base)
+    global_rope = RotaryPositionalEmbeddings(dim=head_dim, max_seq_len=max_seq_len, base=global_rope_base)


I also see that they do linear RoPE scaling by a factor of 8, is that right? If so do we need to make any change here?

We have done this (according to the tech report):

"We increase RoPE base frequency from 10k to 1M on globa lself-attention layers, and keep the frequency of the local layers at 10k."

But yes, I'm not sure about this:
"We find a scaling factor of 8 to work well in practice."

Let me investigate little bit on this further.

ebsmothers · 2025-04-11T23:52:54Z

torchtune/models/gemma3/siglip/_component_builders.py

+        self.final_norm = nn.LayerNorm(embed_dim, layer_norm_eps)
+        self.avg_pool = SiglipAveragePooling()
+
+    @torch.inference_mode


I saw this in the reference implementation -- do you know why they do it?

They do not update the SigLIP model on the training and post-training, using same pre-trained model for the all models

ebsmothers · 2025-04-11T23:55:23Z

torchtune/models/gemma3/siglip/_component_builders.py

+        width = int(seq_len ** 0.5)
+        if width * width != seq_len:
+            raise ValueError(
+                f"Sequence length {seq_len} is not a perfect square. Cannot reshape to a square image."


I saw that Gemma3 expects square images of a fixed size. Where are we doing the image processing to make that happen?

ebsmothers · 2025-04-11T23:59:45Z

torchtune/models/gemma3/_convert_weights.py

+See discussion here: https://github.com/pytorch/torchtune/pull/1835#discussion_r1803410251
+"""
+
+_GEMMA3_FROM_HF = {


To check my understanding here: the 1B model is text-only, 4B+ are all multimodal. This means that the HF weights for the 1B model will look like what you've given here, but the weights for 4B+ models would be language_model.{keys you have here}. And this is why they provide Gemma3ForConditionalGeneration and Gemma3ForCausalLM as the separate classes for the different model sizes. But on our side, I think there are two options:

a) just include the vision keys in the mapping, they should be ignored for the 1B model, then add an optional prefix to every key in the mapping
a) provide a different model type for 4B+ Gemma3 models (e.g. Gemma3VLM)

Personally I think (a) is preferable if it's feasible.

You also mentioned there were some difficulties around getting the information you need for the 4B+ models from the config. Is there a hard blocker here? Naively looking at what you added in _checkpointer.py it seems to me that it should work, but maybe I am missing something obvious. (Maybe we can move that logic into a utility in this file and import it, so as not to clutter up the checkpointer code -- something like _infer_gemma3_attn_data_from_config)

Assuming that a) is a first a) not a second one. The problem is that it looks like we can't do it like this, because of the different structure of the checkpoints :/ Speaking about config.json problem, check this out: https://huggingface.co/google/gemma-3-4b-it/discussions/14

krammnic · 2025-04-16T17:34:21Z

o the actual builders are all text-only. I think that's fine if there's still stuff up in the air (viz. (2)), but wonder if we should do something similar to our other multimodal models: provide g

Sure! I will push this changes (with EarlyFusion after we will decide on the structure of the 1B vs 4B+ and discuss the blocker)
Speaking about blocker: https://huggingface.co/google/gemma-3-4b-it/discussions/14

Mark Obozov added 8 commits March 12, 2025 12:51

add configs

7e0b010

update recipe registry

3802640

structure

9559a3c

gemma3 component builders (without siglip)

7a647fb

update params, include QK-norm

5f71046

lint + nit

3745b28

fix rope scalling

e61667f

model builders

de14f92

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 12, 2025

Mark Obozov added 8 commits March 12, 2025 15:25

siglip component builder

57b0642

lint

0a17853

siglip component builders

5c51114

siglip model builder

c7b9b7a

lint

ffe716f

fixes

d2cb94f

fix context length

e0907da

convert weight + fixes + checkpointer

683e9b7

krammnic changed the title ~~[Early WIP] Gemma3 support.~~ [WIP] Gemma3 support. Mar 12, 2025

Mark Obozov added 6 commits March 12, 2025 17:52

fix norm

07f7ec3

head dim in qk norm

2d6cdde

rms norm

1c7c80c

rms norm

fe5b651

fix

c78bab5

missing file

02b48f1

checkpoinet for 4B, 12B, 27B

0104e39

ebsmothers reviewed Apr 12, 2025

View reviewed changes

Mark Obozov added 4 commits April 16, 2025 16:45

nits + configs

9a170ce

__init__.py

88a10fd

docstrings

0a22411

update arguments in builders

2cf80fb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Gemma3 support. #2485

[WIP] Gemma3 support. #2485

krammnic commented Mar 12, 2025

pytorch-bot bot commented Mar 12, 2025 •

edited

Loading

krammnic commented Mar 12, 2025 •

edited

Loading

krammnic commented Mar 12, 2025

felipemello1 commented Mar 12, 2025

krammnic commented Mar 12, 2025

bzz commented Apr 4, 2025

krammnic commented Apr 6, 2025

ebsmothers left a comment

ebsmothers Apr 11, 2025

ebsmothers Apr 11, 2025

ebsmothers Apr 11, 2025

ebsmothers Apr 11, 2025

ebsmothers Apr 11, 2025

ebsmothers Apr 11, 2025

krammnic Apr 16, 2025

ebsmothers Apr 11, 2025

krammnic Apr 16, 2025

ebsmothers Apr 11, 2025

krammnic Apr 16, 2025

ebsmothers Apr 11, 2025

ebsmothers Apr 11, 2025

krammnic Apr 16, 2025

krammnic commented Apr 16, 2025

		@@ -0,0 +1,114 @@
		# Config for multi-device QLoRA finetuning in lora_finetune_single_device.py

		local_rope = RotaryPositionalEmbeddings(dim=head_dim, max_seq_len=max_seq_len, base=local_rope_base)
		global_rope = RotaryPositionalEmbeddings(dim=head_dim, max_seq_len=max_seq_len, base=global_rope_base)

[WIP] Gemma3 support. #2485

Are you sure you want to change the base?

[WIP] Gemma3 support. #2485

Conversation

krammnic commented Mar 12, 2025

Context

Changelog

Test plan

UX

pytorch-bot bot commented Mar 12, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2485

krammnic commented Mar 12, 2025 • edited Loading

krammnic commented Mar 12, 2025

felipemello1 commented Mar 12, 2025

krammnic commented Mar 12, 2025

bzz commented Apr 4, 2025

krammnic commented Apr 6, 2025

ebsmothers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krammnic commented Apr 16, 2025

pytorch-bot bot commented Mar 12, 2025 •

edited

Loading

krammnic commented Mar 12, 2025 •

edited

Loading