You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment
Published Date
2025-02-16
Source
arXiv
Head Name
Functional Head
Summary
Innovation: The paper introduces Soteria, a method for improving multilingual safety in large language models by adjusting language-specific "functional heads" responsible for harmful content generation. It requires modifying only about 3% of model parameters to reduce policy violations and harmful outputs, achieving safety alignment without degrading overall model performance.
Tasks: The study involves identifying important attention heads specific to each language in multilingual models and selectively tuning them to mitigate harmful output. It employs experiments on datasets like MultiJail and XThreatBench, which include multilingual prompts designed to test model safety across different languages and resource levels.
Significant Result: Soteria significantly reduces the attack success rate (ASR) of harmful content generation across high-, mid-, and low-resource languages. It is shown to be more effective than existing safety mechanisms, achieving lower ASR while maintaining model utility in diverse linguistic contexts.
The text was updated successfully, but these errors were encountered:
Title
Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment
Published Date
2025-02-16
Source
arXiv
Head Name
Functional Head
Summary
Innovation: The paper introduces Soteria, a method for improving multilingual safety in large language models by adjusting language-specific "functional heads" responsible for harmful content generation. It requires modifying only about 3% of model parameters to reduce policy violations and harmful outputs, achieving safety alignment without degrading overall model performance.
Tasks: The study involves identifying important attention heads specific to each language in multilingual models and selectively tuning them to mitigate harmful output. It employs experiments on datasets like MultiJail and XThreatBench, which include multilingual prompts designed to test model safety across different languages and resource levels.
Significant Result: Soteria significantly reduces the attack success rate (ASR) of harmful content generation across high-, mid-, and low-resource languages. It is shown to be more effective than existing safety mechanisms, achieving lower ASR while maintaining model utility in diverse linguistic contexts.
The text was updated successfully, but these errors were encountered: