Skip to content

Fix error of PEFT with disable adapters and FSDP#3001

Open
Isalia20 wants to merge 8 commits intohuggingface:mainfrom
Isalia20:fix-fsdp-disable-adapters
Open

Fix error of PEFT with disable adapters and FSDP#3001
Isalia20 wants to merge 8 commits intohuggingface:mainfrom
Isalia20:fix-fsdp-disable-adapters

Conversation

@Isalia20
Copy link

@Isalia20 Isalia20 commented Jan 19, 2026

Fixes #2977

For tests, I'm not sure if PEFT repo has multiple gpu nodes in CI for testing this. By searching through tests, I couldn't find distributed tests. Would be glad to add it if you point me to where it is/example test

@Isalia20
Copy link
Author

@githubnemo Tagging for review

@varunlmxd
Copy link

I tried this got this error
AssertionError: Invalid: ``_post_backward_hook_state``: (<AccumulateGrad object at 0x7ff3b925a230>, <torch.utils.hooks.RemovableHandle object at 0x7ff3b925a200>)

@Isalia20
Copy link
Author

Isalia20 commented Jan 21, 2026

I think that is a different not related to this one error. I had a similar error in my training which no longer occurs after this change

@githubnemo
Copy link
Collaborator

For tests, I'm not sure if PEFT repo has multiple gpu nodes in CI for testing this. By searching through tests, I couldn't find distributed tests. Would be glad to add it if you point me to where it is/example test

The merged PR #2856 includes deepspeed/fsdp for multi GPU, maybe that helps already?

@Isalia20
Copy link
Author

@githubnemo I think that should have fixed the bugs, can you run the workflow/review when you get a chance?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Isalia20
Copy link
Author

Hmm, I think error in check_code_quality workflow is not related to this pr

Copy link
Collaborator

@githubnemo githubnemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates.

I've ran the test on a multi-gpu machine and got the following error:

Testing disable_adapters()...
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/ubuntu/code/peft/githubnemo/tests/training/test_fsdp_adapters.py", line 256, in <module>
[rank1]:     main(test_name=args.test, model_id=args.model_id, quant=args.quant)
[rank1]:   File "/home/ubuntu/code/peft/githubnemo/tests/training/test_fsdp_adapters.py", line 237, in main
[rank1]:     test_disable_adapters(model_id, quant)
[rank1]:   File "/home/ubuntu/code/peft/githubnemo/tests/training/test_fsdp_adapters.py", line 115, in test_disable_adapters
[rank1]:     model.disable_adapters()
[rank1]:   File "/home/ubuntu/envs/peft-githubnemo/lib/python3.12/site-packages/transformers/integrations/peft.py", line 724, in disable_adapters
[rank1]:     raise ValueError("No adapter loaded. Please load an adapter first.")
[rank1]: ValueError: No adapter loaded. Please load an adapter first.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ubuntu/code/peft/githubnemo/tests/training/test_fsdp_adapters.py", line 256, in <module>
[rank0]:     main(test_name=args.test, model_id=args.model_id, quant=args.quant)
[rank0]:   File "/home/ubuntu/code/peft/githubnemo/tests/training/test_fsdp_adapters.py", line 237, in main
[rank0]:     test_disable_adapters(model_id, quant)
[rank0]:   File "/home/ubuntu/code/peft/githubnemo/tests/training/test_fsdp_adapters.py", line 115, in test_disable_adapters
[rank0]:     model.disable_adapters()
[rank0]:   File "/home/ubuntu/envs/peft-githubnemo/lib/python3.12/site-packages/transformers/integrations/peft.py", line 724, in disable_adapters
[rank0]:     raise ValueError("No adapter loaded. Please load an adapter first.")
[rank0]: ValueError: No adapter loaded. Please load an adapter first.

I also left some comments.

Hmm, I think error in check_code_quality workflow is not related to this pr

Yep, you can ignore that for now.

Comment on lines 213 to 221
# Test set_adapter - should not raise
print_if_process_zero("Testing set_adapter('adapter2')...")
model.set_adapter("adapter2")
print_if_process_zero("set_adapter('adapter2') succeeded!")

# Test switching back
print_if_process_zero("Testing set_adapter('adapter1')...")
model.set_adapter("adapter1")
print_if_process_zero("set_adapter('adapter1') succeeded!")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the test should mimic a more realistic training scenario, i.e. switching to adapter 1, training a bit, switching to adapter 2, training that.

@Isalia20
Copy link
Author

Thanks for the updates.

I've ran the test on a multi-gpu machine and got the following error:

Testing disable_adapters()...
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/ubuntu/code/peft/githubnemo/tests/training/test_fsdp_adapters.py", line 256, in <module>
[rank1]:     main(test_name=args.test, model_id=args.model_id, quant=args.quant)
[rank1]:   File "/home/ubuntu/code/peft/githubnemo/tests/training/test_fsdp_adapters.py", line 237, in main
[rank1]:     test_disable_adapters(model_id, quant)
[rank1]:   File "/home/ubuntu/code/peft/githubnemo/tests/training/test_fsdp_adapters.py", line 115, in test_disable_adapters
[rank1]:     model.disable_adapters()
[rank1]:   File "/home/ubuntu/envs/peft-githubnemo/lib/python3.12/site-packages/transformers/integrations/peft.py", line 724, in disable_adapters
[rank1]:     raise ValueError("No adapter loaded. Please load an adapter first.")
[rank1]: ValueError: No adapter loaded. Please load an adapter first.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ubuntu/code/peft/githubnemo/tests/training/test_fsdp_adapters.py", line 256, in <module>
[rank0]:     main(test_name=args.test, model_id=args.model_id, quant=args.quant)
[rank0]:   File "/home/ubuntu/code/peft/githubnemo/tests/training/test_fsdp_adapters.py", line 237, in main
[rank0]:     test_disable_adapters(model_id, quant)
[rank0]:   File "/home/ubuntu/code/peft/githubnemo/tests/training/test_fsdp_adapters.py", line 115, in test_disable_adapters
[rank0]:     model.disable_adapters()
[rank0]:   File "/home/ubuntu/envs/peft-githubnemo/lib/python3.12/site-packages/transformers/integrations/peft.py", line 724, in disable_adapters
[rank0]:     raise ValueError("No adapter loaded. Please load an adapter first.")
[rank0]: ValueError: No adapter loaded. Please load an adapter first.

I also left some comments.

Hmm, I think error in check_code_quality workflow is not related to this pr

Yep, you can ignore that for now.

Thanks for running this. Will take a look comments in the upcoming days and fix it. And probably get a multi gpu machine to test this out properly

@Isalia20
Copy link
Author

Updated the tests and resolved comments and tested on multi gpu device. Should be ready now

@Isalia20
Copy link
Author

Isalia20 commented Feb 3, 2026

Hmm, I think test errors are not related to this PR 🤔 @githubnemo wdyt?

@githubnemo
Copy link
Collaborator

@Isalia20 yep, they're not related. If you merge with main this issue will disappear.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Isalia20, thanks for tackling this issue and providing a solution and a test. I ran the test on my machine and can confirm that it works. Besides what @githubnemo, could you please check my comments?

super().__init__()
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
torch_dtype=torch.bfloat16,
dtype=torch.bfloat16,

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's fix the random seed, just to be sure.

Copy link
Author

@Isalia20 Isalia20 Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set the seed to:
torch.manual_seed(42)
on line 118 already, any other ways I can set it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must have missed that, thanks.

final_base_weights = get_base_model_weights(model.peft_model)
final_second_adapter_weights = get_adapter_weights(model.peft_model, "second_adapter")

verify_weights_unchanged(initial_base_weights, final_base_weights, "Base model")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why we check if the base weights changed? Is this something we would ever expect?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wouldn't expect the base weights to change. Test is to make sure that through this FSDP setup the base weights remain unchanged. It was initially requested by @githubnemo and I also think it's a good thing to check in this test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. Let's add a comment that this is a sanity check.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @Isalia20. Could you please run make style?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must have missed that, thanks.

final_base_weights = get_base_model_weights(model.peft_model)
final_second_adapter_weights = get_adapter_weights(model.peft_model, "second_adapter")

verify_weights_unchanged(initial_base_weights, final_base_weights, "Base model")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. Let's add a comment that this is a sanity check.

@Isalia20
Copy link
Author

Oh my bad, I thought pre commit did that. Let me run it and update

@Isalia20
Copy link
Author

Updated

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the latest update. The failing CI is unrelated and can thus be ignored.

I re-reviewed the PR and I wonder if the two changes (in enable_adapters and set_adapter) could not be factored out to a single helper function, since the code is pretty similar. Moreover, I think the same logic needs to be applied here:

def set_requires_grad(self, adapter_names: str | Sequence[str], requires_grad: bool = True) -> None:
"""
Enable or disable gradients on the given adapter(s).
Args:
adapter_name (`str` or `Sequence[str]`):
The name of the adapter(s) whose gradients should be enabled/disabled.
requires_grad (`bool`, *optional*)
Whether to enable (`True`, default) or disable (`False`).
"""
if isinstance(adapter_names, str):
adapter_names_set = {adapter_names}
else:
adapter_names_set = set(adapter_names)
for layer_name in self.adapter_layer_names:
# use attrgetter, as it resolves `.` in the attribute name
module_dict = attrgetter(layer_name)(self)
for key, layer in module_dict.items():
if key in adapter_names_set:
layer.requires_grad_(requires_grad)

The helper function could be called there as well. WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unable to use enable/disable_adapter_layers in FSDP training

5 participants