Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new paper: #52

Open
wyzh0912 opened this issue Feb 23, 2025 · 0 comments
Open

Add new paper: #52

wyzh0912 opened this issue Feb 23, 2025 · 0 comments

Comments

@wyzh0912
Copy link
Contributor

Title

Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment

Published Date

2025-02-16

Source

arXiv

Head Name

Functional Head

Summary

  • Innovation: The paper introduces Soteria, a method for improving multilingual safety in large language models by adjusting language-specific "functional heads" responsible for harmful content generation. It requires modifying only about 3% of model parameters to reduce policy violations and harmful outputs, achieving safety alignment without degrading overall model performance.

  • Tasks: The study involves identifying important attention heads specific to each language in multilingual models and selectively tuning them to mitigate harmful output. It employs experiments on datasets like MultiJail and XThreatBench, which include multilingual prompts designed to test model safety across different languages and resource levels.

  • Significant Result: Soteria significantly reduces the attack success rate (ASR) of harmful content generation across high-, mid-, and low-resource languages. It is shown to be more effective than existing safety mechanisms, achieving lower ASR while maintaining model utility in diverse linguistic contexts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant