Skip to content

Conversation

josejg
Copy link
Contributor

@josejg josejg commented Jul 28, 2025

Context

Currently torch.nn.utils.clip_grad_norm_ and FSDP.clip_grad_norm_ apply the gradient normalization in place but also return the pre-clip gradient norm value, however the value is not capture nor logged anywhere.

We can't change the API for all gradient clipping methods since some don't have top level scalar, but we can for gradient norm clipping, the most frequent one we use.

This PR propagates the value outside of the helper and into the algorithm where it can be logged. Since clipping is fairly bursty, we also compute the rolling window over clipping_frequency_window samples to provide a more parseable metric.

Known caveats

  • _clipping_history is not persisted so the metric will change slightly upon resumption
  • If gradient clipping is not enabled, the gradient norm won't be logged. For logging without clipping the best solution is to set clipping to really high threshold, e.g. clipping_threshold: 100

Experiments

Couple of example experiments that showcase the functionality with SFT and GRPO are 2025-07-21-debug-gradient-clipping and 2025-07-25-math-rlvr-grpo respectively

@josejg josejg requested a review from a team as a code owner July 28, 2025 19:44
fsdp_enabled (bool): Bool of if the model is a FSDP model or not.
Returns:
Union[torch.Tensor, None]: The total gradient norm before clipping for 'norm' clipping type,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just always return the result, not just for norm.

It's a weird contract for a separate function downstream to know this behavior

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downstream function always guards by checking if the clipping_type is norm, so that should be good enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it but the other two options don't have top-level scalar values that can be returned:

Hence why I stuck with returning None for those. I agree that the contract is awkward, but we needed to propagate the norm to the logger.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh lol, why do they modify in place for value and not for norm 😆 😓

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved, not ideal, but makes sense that this is the best we can do

@josejg josejg requested a review from irenedea July 30, 2025 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants