Refactor losses instantiation and chunked CE #2531

felipemello1 · 2025-03-27T04:41:10Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

IMPORTANT: Recipes do NOT work with older version of ChunkedCrosEntropy anymore, because we dont expect transformer to chunk the outputs.

Problem:

We have seen many chunked losses being added to torchtune. The current setup put the chunking burden on the model.
Users have interest in using losses that require model.output.weight as input, e.g. liger losses

Solution:

Enable the recipe to call loss(weight, input, targets)
Reimplement ChunkedCE, so that chunking and projection happens in the loss.
Adds protocol so that new losses can follow the same pattern

PROFILING: https://drive.google.com/drive/folders/1jHOCuOF74F9lmmJv7wxbcK-i_wtB2stf?usp=sharing

Changelog

Updated full_distributed and lora_distributed
Tested with lora llama 3.2 distributed (TiedLinear)
Implemented new ChunkedCE

TODO: when approved, will implement it to the other recipes/losses/update configs

Test

ChunkedCrossEntropyLoss

tune run --nproc_per_node 2 lora_finetune_distributed --config /data/users/felipemello/torchtune/recipes/configs/llama3_2/1B_lora.yaml \
metric_logger=torchtune.training.metric_logging.WandBLogger \
dataset.packed=True \
dataset.split=train[:50%] \
tokenizer.max_seq_len=4096 \
gradient_accumulation_steps=1 \
batch_size=4 \
max_steps_per_epoch=20 \
compile=True \
use_output_weight_in_loss=True \
loss=torchtune.modules.loss.sft_losses.ChunkedCrossEntropyLoss

To reproduce

fork ----> https://github.com/pytorch/torchtune
git clone https://github.com/<YOUR_GITHUB_USER>/torchtune.git

cd torchtune
conda create -n torchtune python=3.11
conda activate torchtune
pip install --pre --upgrade torch torchvision torchao --index-url https://download.pytorch.org/whl/nightly/cu124
pip install -e ".[dev]"
pre-commit install

git remote add felipemello1 https://github.com/felipemello1/torchtune.git
git checkout -b loss_refactor felipemello1/loss_refactor

pytorch-bot · 2025-03-27T04:41:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2531

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torchtune/modules/loss/ce_chunked_output_loss.py

torchtune/modules/transformer.py

torchtune/training/_compile.py

torchtune/modules/transformer.py

torchtune/modules/loss/sft_losses.py

torchtune/modules/model_fusion/_deep_fusion.py

torchtune/modules/transformer.py

torchtune/modules/model_fusion/_deep_fusion.py

torchtune/modules/loss/sft_losses.py

joecummings · 2025-04-04T20:38:42Z

torchtune/modules/loss/sft_losses.py

+        return total_loss / total_elements
+
+
+class ChunkedCrossEntropywithAutograd(torch.autograd.Function):


Why did you want to add these Autograd versions? How does this help you test?

this version is based off horace's code from a few months back. In this implementation, the chunks are not held up in memory. He coded it to show that you dont need trition.

I dont want to keep it in torchtune, because it would be hard to use to for KD/RL losses. This is more a reference for the compile folks. They are working on enabling the chunking on compile to match the autograd memory perf.

Without autograd

with autograd

Can you put a comment in the code to that effect?

SalmanMohammadi · 2025-04-04T20:43:51Z

@felipemello1 this is awesome. Out of curioisity did you happen to benchmark against the existing CEWithChunkedOutputLoss?

I wonder if we could simplify the configuration further by removing the need for the user to also specify use_output_weight_in_loss? Could we define a

class BaseLoss(Protocol):
    is_chunked: bool

and do

-                if self.use_output_weight_in_loss:
+                if self.loss_fn.is_chunked:
                    weight = self._model.get_output_weight()
                    current_loss = self._loss_fn(weight, outputs, labels)
                else:
                    labels = labels.reshape(-1)
                    logits = logits.reshape(-1, logits.size(-1))
                    outputs = outputs.reshape(-1, outputs.size(-1))
                    current_loss = self._loss_fn(outputs, labels)

It would require either 1) requiring that all losses use this protocol (which tbh I wouldn't be opposed to as we start to support more custom losses without needing to modify recipes), or doing a hasattr check on self._loss_fn and relying on an identifying field on just the chunked losses.

wdyt?

torchtune/modules/transformer.py

felipemello1 · 2025-04-04T20:49:25Z

I wonder if we could simplify the configuration further by removing the need for the user to also specify use_output_weight_in_loss?

@SalmanMohammadi , i thought about it and even implemented, but then realized that it would be hard to support 3rd party libraries, unless we create some sort of loss adapter, which we may need to do anyway, because not all libraries follow the patten (weight, input, label). They may follow (label, weight, input), for example.

the loss adapter could be something like:

config.yaml

loss:
	_component_: torchtune.loss.lossadapter
   loss: path.to.loss
   requires_weight_input: True
   input_order: ["label", "weight", "input"]

Co-authored-by: salman <[email protected]>

torchtune/modules/transformer.py

torchtune/training/_compile.py

SalmanMohammadi · 2025-04-04T20:58:26Z

I wonder if we could simplify the configuration further by removing the need for the user to also specify use_output_weight_in_loss?

@SalmanMohammadi , i thought about it and even implemented, but then realized that it would be hard to support 3rd party libraries, unless we create some sort of loss adapter, which we may need to do anyway, because not all libraries follow the patten (weight, input, label). They may follow (label, weight, input), for example.

the loss adapter could be something like:

config.yaml
loss:
	_component_: torchtune.loss.lossadapter
   loss: path.to.loss
   requires_weight_input: True
   input_order: ["label", "weight", "input"]

I can't think of any 3rd party losses which we claim to support which would fall into this category - do you have any examples? I would say that having a stricter contract about which losses we do support would make interoperability more straightforward - i.e. a user would know exactly how to define a

class MyTorchtuneLoss(TorchtuneLossProtocol):
    ...
    self.loss = ThirdPartyLossChunkedLoss(...)
    self.is_chunked = True

    def forward(weight, input, label):
        return self.loss(input, weight, label)

If we're getting too in the weeds here I'm happy with how you've implemented it in this PR and leaving this discussion as a follow up : )

felipemello1 · 2025-04-04T21:01:15Z

I can't think of any 3rd party losses which we claim to support which would fall into this category - do you have any examples?

yes, liger and apple:
https://github.com/linkedin/Liger-Kernel/tree/main/src/liger_kernel/chunked_loss
https://github.com/apple/ml-cross-entropy/tree/main/cut_cross_entropy

If we're getting too in the weeds here I'm happy with how you've implemented it in this PR and leaving this discussion as a follow up : )

I think that the time is now, so we dont have to refactor it again :P

… loss_refactor

joecummings

Approving to unblock but there's a few things related to consistency and documentation that should be cleaned up.

joecummings · 2025-04-15T15:14:38Z

torchtune/modules/loss/loss_protocols.py

+import torch
+
+
+class SFTLossWithProjection(Protocol):


Probably needs to be SFTLossWithOutputProj or something. Projection is too vague.

I agree that this name is confusing. I think we should just standardize on "fused" or "linear", or "chunked". All the names have issues which we've discussed but if we're consistent at least people should be able to learn the term quickly.

joecummings · 2025-04-15T15:17:27Z

torchtune/modules/loss/cross_entropy_loss.py

+from .loss_protocols import SFTLossWithProjection
+
+
+class ChunkedCrossEntropyLoss(nn.Module, SFTLossWithProjection):


Can you add this to the docs? Also maybe include a slightly longer description of why we might want to use this. And how we might use this in a generic training loop.

nit:

Suggested change

class ChunkedCrossEntropyLoss(nn.Module, SFTLossWithProjection):

class MegaProjChunkyLossinator(nn.Module, SFTLossWithProjection):

joecummings · 2025-04-15T15:18:26Z

torchtune/training/_profiler.py

@@ -114,12 +114,6 @@ def trace_handler(
    # Memory timeline sometimes fails to export
    if prof.profile_memory and torch.cuda.is_available():
        if rank == 0:
-            try:


Why was this removed?

joecummings · 2025-04-15T15:18:38Z

torchtune/utils/__init__.py

@@ -26,4 +26,6 @@
    "get_torch_device_namespace",
    "DeviceSupport",
    "log_rank_zero",
+    "deprecated",


Are these in the docs?

good catch. I forgot to check

joecummings · 2025-04-15T15:19:09Z

recipes/full_finetune_distributed.py

-            # set num_output_chunks for model
-            self._model.set_num_output_chunks(self._loss_fn.num_output_chunks)
+        # The loss may handle the output projection. If true, the model should skip it.
+        self.use_output_weight_in_loss = getattr(


output weight & output proj? Should stay consistent, no?

SalmanMohammadi · 2025-04-15T16:31:10Z

torchtune/modules/loss/loss_protocols.py

@@ -0,0 +1,67 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.


Our protocols for tokenizers live in tokenizers/_utils.py. Do you think it's worth keeping things consistent and renaming this to _utils.py?

i could move it, but it doesnt feel intuitive that protocols are in utils. Do you think its a better choice, or is it just to keep things consistent?

Yeah I do like protocols better. Since we're already in the loss module maybe just protocols.py?

SalmanMohammadi · 2025-04-15T16:32:37Z

torchtune/modules/loss/loss_protocols.py

+    def apply_compile_strategy(self, *args, **kwargs):
+        """Torch compiles the loss function. Can be useful when greater control is needed,
+        for example when only compiling a portion of the loss calculation."""
+        self.forward = torch.compile(self.forward, *args, **kwargs)


Should we be doing this here? I'd vote to add this into the docstring as an example

SalmanMohammadi · 2025-04-15T16:32:49Z

torchtune/modules/loss/loss_protocols.py

+
+    use_output_proj_in_loss: bool = False
+
+    def apply_compile_strategy(self, *args, **kwargs):


similar comment to above

SalmanMohammadi · 2025-04-15T16:33:21Z

torchtune/modules/transformer.py

-            Union[torch.Tensor, List[torch.Tensor]]: output tensor with shape ``[b x s x v]`` or a list of layer
-                output tensors defined by ``output_hidden_states`` with the
-                final output tensor appended to the list.
+            Union[torch.Tensor, List[torch.Tensor]]: output tensor with shape ``[b x s x v]`` if `self.skip_output_projection=False`


Suggested change

Union[torch.Tensor, List[torch.Tensor]]: output tensor with shape ``[b x s x v]`` if `self.skip_output_projection=False`

Union[torch.Tensor, List[torch.Tensor]]: output tensor with shape ``[b x s x v]`` if ``self.skip_output_projection=False``

my bad

SalmanMohammadi · 2025-04-15T16:34:02Z

torchtune/modules/loss/loss_protocols.py

+
+    use_output_proj_in_loss: bool = True
+
+    def apply_compile_strategy(self, *args, **kwargs):


Thoughts on naming this def compile? Is that too vague?

i believe it would override the module.compile method. We probably dont want it.

torchtune/modules/loss/loss_protocols.py

SalmanMohammadi · 2025-04-15T16:34:29Z

torchtune/modules/loss/loss_protocols.py

+            outputs (torch.Tensor): Logits of the model. Shape [bsz, seq_len, vocab_size]
+            targets (torch.Tensor): Labels for the model. Shape [bsz, seq_len]


Suggested change

outputs (torch.Tensor): Logits of the model. Shape [bsz, seq_len, vocab_size]

targets (torch.Tensor): Labels for the model. Shape [bsz, seq_len]

outputs (torch.Tensor): Logits of the model. Shape ``[bsz, seq_len, vocab_size]``

targets (torch.Tensor): Labels for the model. Shape ``[bsz, seq_len]``

SalmanMohammadi · 2025-04-15T16:35:08Z

recipes/full_finetune_distributed.py

                # Shift labels to compute loss
                # equivalent to doing labels[..., 1:] and logits[..., :-1, :]
                # But this way we dont need to slice the logits. We just add an ignore index to labels.
                labels = torch.hstack(
                    (labels[..., 1:], self.ignore_labels_cache[: labels.shape[0]])
                )
-                if not isinstance(logits, list):
+
+                if self.use_output_weight_in_loss:


SalmanMohammadi · 2025-04-15T16:36:37Z

recipes/full_finetune_distributed.py

-            # set num_output_chunks for model
-            self._model.set_num_output_chunks(self._loss_fn.num_output_chunks)
+        # The loss may handle the output projection. If true, the model should skip it.
+        self.use_output_weight_in_loss = getattr(


tangential point: if the contract is that SFT losses follow the protocols defined in loss_protocols, do we need to make this check?

someone may try to use a loss that is not from torchtune, e.g. vanilla F.cross_entropy_loss

torchtune/modules/loss/cross_entropy_loss.py

SalmanMohammadi · 2025-04-15T16:40:48Z

torchtune/modules/loss/loss_protocols.py

+
+
+class SFTLoss(Protocol):
+    """Protocol for loss functions in torchtune used in sft recipes."""


Suggested change

"""Protocol for loss functions in torchtune used in sft recipes."""

"""Protocol for loss functions in torchtune used in SFT recipes."""

I dont know if i like "SFT" here, since it may not be obvious for a new reader what it means

SalmanMohammadi · 2025-04-15T16:41:01Z

torchtune/modules/loss/loss_protocols.py

+
+
+class SFTLossWithProjection(Protocol):
+    """Protocol for loss functions in torchtune used in Supervised Finetune recipes and that require


Suggested change

"""Protocol for loss functions in torchtune used in Supervised Finetune recipes and that require

"""Protocol for loss functions in torchtune used in SFT recipes and that require

I prefer "SFTI dont know if i like "SFT" here, since it may not be obvious for a new reader what it means

SalmanMohammadi

real nice

pbontrager

Thanks for this big effort. This looks good and I'm happy to approve it now. Please finish going through and resolving the open comments before landing.

pbontrager · 2025-03-28T15:13:08Z

recipes/full_finetune_distributed.py

+        # skip final projection, since the loss takes hidden input instead of logits
+        self.skip_unembedding = cfg.get("loss_takes_embeddings", False)
+        self._model.set_skip_unembedding(self.skip_unembedding)


nit: skip_output_layer

pbontrager · 2025-04-16T14:44:16Z

torchtune/modules/loss/loss_protocols.py

+import torch
+
+
+class SFTLossWithProjection(Protocol):


I agree that this name is confusing. I think we should just standardize on "fused" or "linear", or "chunked". All the names have issues which we've discussed but if we're consistent at least people should be able to learn the term quickly.

pbontrager · 2025-04-16T14:48:29Z

torchtune/modules/loss/cross_entropy_loss.py

+                target_chunks[idx],
+            )
+
+        return total_loss / total_elements


nit: it'd be nice to offer the same 'reduction' option as most pytorch losses to control returning the mean, sum, or no reduction

pbontrager · 2025-04-16T14:50:11Z

recipes/lora_finetune_distributed.py

@@ -301,9 +301,12 @@ def setup(self, cfg: DictConfig) -> None:
        if self._compile:


What's the plan for rolling this out to the other sft recipes?

Recipes NOT being updated should still work with configs NOT being updated

Recipes being updated should NOT work anymore with old ce_with_chunked_outputs_loss

So any recipe that is changed also requires the configs to be updated with the new loss

TODO: need to check if the deprecation warnings work fine. This can be checked by running a recipe/config that has not been updated.

pbontrager · 2025-04-16T14:55:49Z

torchtune/modules/transformer.py

@@ -396,6 +400,7 @@ def __init__(
        self.head_dim = head_dim
        self.causal_mask = None
        self.num_output_chunks = 0
+        self._skip_output_projection = False



You should enforce in init that the output module has the "weight" property

Co-authored-by: salman <[email protected]>

Refactor losses instantion and chunked CE

74814e8

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 27, 2025

felipemello1 mentioned this pull request Mar 27, 2025

Linear Cross Entropy #2507

Draft

13 tasks

felipemello1 commented Mar 27, 2025

View reviewed changes

torchtune/modules/loss/ce_chunked_output_loss.py Outdated Show resolved Hide resolved

felipemello1 commented Mar 27, 2025

View reviewed changes

torchtune/modules/loss/ce_chunked_output_loss.py Outdated Show resolved Hide resolved

felipemello1 commented Mar 27, 2025

View reviewed changes

torchtune/modules/transformer.py Outdated Show resolved Hide resolved

felipemello1 commented Mar 27, 2025

View reviewed changes

torchtune/training/_compile.py Show resolved Hide resolved

felipemello1 changed the title ~~Refactor losses installation and chunked CE~~ Refactor losses instantiation and chunked CE Mar 27, 2025

felipemello1 marked this pull request as draft March 31, 2025 14:39

Felipe Mello added 7 commits March 31, 2025 10:44

updates

8eb9fb3

Merge branch 'main' into loss_refactor

70a475d

fix test

5e76d8f

fix test

578fdd7

add ChunkedCrossEntropywithAutogradLoss test

75a34ac

add ChunkedCrossEntropywithAutogradLoss test

a67307c

remove var

21601bb

felipemello1 marked this pull request as ready for review March 31, 2025 22:16

SalmanMohammadi reviewed Apr 4, 2025

View reviewed changes

torchtune/modules/transformer.py Show resolved Hide resolved

SalmanMohammadi reviewed Apr 4, 2025

View reviewed changes

torchtune/modules/loss/sft_losses.py Outdated Show resolved Hide resolved

SalmanMohammadi reviewed Apr 4, 2025

View reviewed changes

torchtune/modules/model_fusion/_deep_fusion.py Outdated Show resolved Hide resolved

SalmanMohammadi reviewed Apr 4, 2025

View reviewed changes

torchtune/modules/model_fusion/_deep_fusion.py Show resolved Hide resolved

joecummings reviewed Apr 4, 2025

View reviewed changes

torchtune/modules/transformer.py Outdated Show resolved Hide resolved

torchtune/modules/transformer.py Outdated Show resolved Hide resolved

torchtune/modules/model_fusion/_deep_fusion.py Show resolved Hide resolved

torchtune/modules/loss/sft_losses.py Outdated Show resolved Hide resolved

joecummings reviewed Apr 4, 2025

View reviewed changes

SalmanMohammadi reviewed Apr 4, 2025

View reviewed changes

torchtune/modules/transformer.py Outdated Show resolved Hide resolved

Update torchtune/modules/transformer.py

d98bc9c

Co-authored-by: salman <[email protected]>

SalmanMohammadi reviewed Apr 4, 2025

View reviewed changes

torchtune/modules/transformer.py Outdated Show resolved Hide resolved

SalmanMohammadi reviewed Apr 4, 2025

View reviewed changes

torchtune/training/_compile.py Outdated Show resolved Hide resolved

Felipe Mello added 9 commits April 9, 2025 11:07

Merge branch 'main' into loss_refactor

1435e05

Merge remote-tracking branch 'refs/remotes/origin/loss_refactor' into…

fb0122e

… loss_refactor

comments round 1

ef35b03

add protocol

0cb32be

compile compute_cross_entropy

3c7b3f5

Merge branch 'main' into loss_refactor

1dd7730

merge

fa35054

ammend

0078c31

improve docstring

c90d4f4

felipemello1 requested review from joecummings and SalmanMohammadi April 14, 2025 18:09

joecummings approved these changes Apr 15, 2025

View reviewed changes

SalmanMohammadi reviewed Apr 15, 2025

View reviewed changes

torchtune/modules/loss/loss_protocols.py Outdated Show resolved Hide resolved

SalmanMohammadi reviewed Apr 15, 2025

View reviewed changes

torchtune/modules/loss/cross_entropy_loss.py Outdated Show resolved Hide resolved

SalmanMohammadi reviewed Apr 15, 2025

View reviewed changes

SalmanMohammadi approved these changes Apr 15, 2025

View reviewed changes

pbontrager approved these changes Apr 16, 2025

View reviewed changes

felipemello1 and others added 2 commits April 28, 2025 16:06

Update torchtune/modules/loss/loss_protocols.py

14ccb10

Co-authored-by: salman <[email protected]>

Update torchtune/modules/loss/cross_entropy_loss.py

bfb038a

Co-authored-by: salman <[email protected]>

		return total_loss / total_elements


		class ChunkedCrossEntropywithAutograd(torch.autograd.Function):

		from .loss_protocols import SFTLossWithProjection


		class ChunkedCrossEntropyLoss(nn.Module, SFTLossWithProjection):

	class ChunkedCrossEntropyLoss(nn.Module, SFTLossWithProjection):
	class MegaProjChunkyLossinator(nn.Module, SFTLossWithProjection):

		@@ -0,0 +1,67 @@
		# Copyright (c) Meta Platforms, Inc. and affiliates.


		use_output_proj_in_loss: bool = False

		def apply_compile_strategy(self, args, *kwargs):

	Union[torch.Tensor, List[torch.Tensor]]: output tensor with shape ``[b x s x v]`` if `self.skip_output_projection=False`
	Union[torch.Tensor, List[torch.Tensor]]: output tensor with shape ``[b x s x v]`` if ``self.skip_output_projection=False``


		use_output_proj_in_loss: bool = True

		def apply_compile_strategy(self, args, *kwargs):

		outputs (torch.Tensor): Logits of the model. Shape [bsz, seq_len, vocab_size]
		targets (torch.Tensor): Labels for the model. Shape [bsz, seq_len]



		class SFTLoss(Protocol):
		"""Protocol for loss functions in torchtune used in sft recipes."""



		class SFTLossWithProjection(Protocol):
		"""Protocol for loss functions in torchtune used in Supervised Finetune recipes and that require

		@@ -301,9 +301,12 @@ def setup(self, cfg: DictConfig) -> None:
		if self._compile:

Refactor losses instantiation and chunked CE #2531

Are you sure you want to change the base?

Refactor losses instantiation and chunked CE #2531

Conversation

felipemello1 commented Mar 27, 2025 • edited Loading

Context

Changelog

Test

To reproduce

pytorch-bot bot commented Mar 27, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2531

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SalmanMohammadi commented Apr 4, 2025

felipemello1 commented Apr 4, 2025 • edited Loading

SalmanMohammadi commented Apr 4, 2025

felipemello1 commented Apr 4, 2025 • edited Loading

joecummings left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SalmanMohammadi left a comment

Choose a reason for hiding this comment

pbontrager left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felipemello1 commented Mar 27, 2025 •

edited

Loading

pytorch-bot bot commented Mar 27, 2025 •

edited

Loading

felipemello1 commented Apr 4, 2025 •

edited

Loading

felipemello1 commented Apr 4, 2025 •

edited

Loading