Enabling LR scaling for a specific layer (ex. down-projection...) during pretraining #1262

dhia680 · 2024-10-28T11:53:46Z

This PR enables scaling the learning rate of a layer by giving its name in scale-lr-layer and the multiplier in lr-multiplier by using the existing internal logic of scale_l_cond and lr_mult.

Motivation:
MuP and several interesting papers that followed (ex. Depth-MuP) suggest, among other technics s.a layers' output scaling and initializations, to use different LRs depending on width in order to enhance feature learning and avoid that output layers dominate the learning process. When combined with proper initializations and layers' output scaling, it consists of a stable setting especially for sweeping and scaling hyperparameters for pretraining.
Implementation:
Generalizes/makes more flexible the existing use of this feature for lm head during finetuning by making it possible to specify the name of the target layer as well as the LR multiplier.
Extends its use for pretraining as well. When no layer is specified, the scale_lr_cond argument is None and no lr-scaling is applied.
Why?:
A GPT like model typically has an ffn-factor > 1. It's 3.5 for Llama3.1 70B. Which suggests that down-projection (linear_fc2 in Megatron) requires a lower LR. Theoretically LR x 1/ffn_factor.
This way, we don't have to add a new argument (ex. downproj-lr-mult) each time we want to test scaling of a certain layer (ex. linear_fc2).

P.S:
Layers' output scaling (before residual-connections) as introduced in Depth-MuP to account for depth-scaling will be suggested in a separate PR. Same for init.

dhia680 · 2024-11-20T07:42:22Z

Any updates ?
Here is the related issue .

janEbert · 2024-11-21T09:58:16Z

I think this is a really nice and important addition to the code base, which is also emphasized by further research on other parametrizations (https://arxiv.org/abs/2407.05872). However, I think the implementation would ideally be even more flexible and allow specification of multiple layers and learning rates to support all current and future use cases (see, e.g., page 3, Table 1 in the linked paper).

Aside from that, I think it would be helpful to make the change backward-compatible. I.e., keep the old --head-lr-mult argument, but mark it as deprecated.

Then in the argument parsing function, do something like:

if args.head_lr_mult != 1.0:
    warnings.warn(
        '--head-lr-mult is deprecated; please use the'
        '--scale-lr-layer and --lr-multiplier arguments instead.'
    )
    assert args.scale_lr_layer is None and args.lr_multiplier == 1.0, \
        'cannot set --scale-lr-layer or --lr-multiplier when --head-lr-mult is given.'
    args.scale_lr_layer = 'head'
    args.lr_multiplier = args.head_lr_mult

dhia680 · 2024-11-21T11:14:53Z

Thanks @janEbert.
I have 3 points to discuss before suggesting a better version.

I thought of enabling multi-layer lr scaling and had a version that does so in a simple way.
But without a deeper change in the codebase (of how lr-mult is used in megatron/core/optimizer/init.py to create param_groups), this version would be limited to the case of scaling all those mentionned layers with the same lr-mult.
Another reason why this multi-layer lr scaling suggestion could be questionned, is that there is an equivalent decoupled_lr logic specific for [head and embeddings] in Megatron. So, one could choose different lr for these layers with --decoupled-lr and scale linear_fc2 lr with --lr-multiplier, while being able to have 3 different LRs in total. Not just 2.
I agree. It's better to keep the old --head-lr-mult argument and handle its use. This makes me also think of the neccessity to handle with an assertion, the case where both --decoupled-lr and --scale-lr-layer 'head' or 'embedding' are used.

What do you think?

Enabling scaling of many layers' LRs by same factor.
Or scaling many layers with many factors. And in this case, mark any use of --decoupled-lr as deprecated.

Let me tag @jaredcasper (a maintainer).

dhia680 and others added 3 commits October 28, 2024 14:14

Down-projection lr scaling; also adapted to other layers

9e6458d

clean spaces

54c9bdd

Merge branch 'NVIDIA:main' into downproj-lr-scaling

8a2e40b

dhia680 mentioned this pull request Oct 28, 2024

[ENHANCEMENT] Enabling LR scaling for a specific layer (ex. down-projection...) during pretraining #1263

Open

dhia680 changed the title ~~Enabling LR scaling for certain layers (ex. down-projection) during pretraining~~ Enabling LR scaling for a certain layer (ex. down-projection...) during pretraining Oct 28, 2024

dhia680 changed the title ~~Enabling LR scaling for a certain layer (ex. down-projection...) during pretraining~~ Enabling LR scaling for a specific layer (ex. down-projection...) during pretraining Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling LR scaling for a specific layer (ex. down-projection...) during pretraining #1262

Enabling LR scaling for a specific layer (ex. down-projection...) during pretraining #1262

dhia680 commented Oct 28, 2024

dhia680 commented Nov 20, 2024

janEbert commented Nov 21, 2024

dhia680 commented Nov 21, 2024 •

edited

Loading

Enabling LR scaling for a specific layer (ex. down-projection...) during pretraining #1262

Are you sure you want to change the base?

Enabling LR scaling for a specific layer (ex. down-projection...) during pretraining #1262

Conversation

dhia680 commented Oct 28, 2024

dhia680 commented Nov 20, 2024

janEbert commented Nov 21, 2024

dhia680 commented Nov 21, 2024 • edited Loading

dhia680 commented Nov 21, 2024 •

edited

Loading