You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Understanding and Mitigating Gender Bias in LLMs via Interpretable Neuron Editing
Published Date
2025-01-24
Source
arXiv
Head Name
Gender Head
Summary
Innovation: The paper introduces an interpretable neuron editing method to mitigate gender bias in LLMs by targeting specific biased neurons while preserving the model’s original capabilities. This approach combines logit-based and causal-based strategies to selectively edit neurons responsible for gender bias.
Tasks: The study involves the creation of the CommonWords dataset to evaluate gender bias across five LLMs and employs a detailed analysis of neuron circuits to identify "gender neurons" and "general neurons" that contribute to gender bias, followed by experiments to test the proposed neuron editing method.
Significant Result: The proposed method effectively reduces gender bias in LLMs while maintaining their original capabilities, outperforming existing fine-tuning and neuron editing approaches across several benchmarks, as demonstrated by improvements in ICAT scores and reductions in entropy differences.
The text was updated successfully, but these errors were encountered:
Title
Understanding and Mitigating Gender Bias in LLMs via Interpretable Neuron Editing
Published Date
2025-01-24
Source
arXiv
Head Name
Gender Head
Summary
Innovation: The paper introduces an interpretable neuron editing method to mitigate gender bias in LLMs by targeting specific biased neurons while preserving the model’s original capabilities. This approach combines logit-based and causal-based strategies to selectively edit neurons responsible for gender bias.
Tasks: The study involves the creation of the CommonWords dataset to evaluate gender bias across five LLMs and employs a detailed analysis of neuron circuits to identify "gender neurons" and "general neurons" that contribute to gender bias, followed by experiments to test the proposed neuron editing method.
Significant Result: The proposed method effectively reduces gender bias in LLMs while maintaining their original capabilities, outperforming existing fine-tuning and neuron editing approaches across several benchmarks, as demonstrated by improvements in ICAT scores and reductions in entropy differences.
The text was updated successfully, but these errors were encountered: