PyTorch
This repository contains the code and experiments for the paper Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework by Kopf et al., 2025.
Unlike prior approaches that assign a single description per feature, PRISM (Polysemantic FeatuRe Identification and Scoring Method) provides more nuanced descriptions for both polysemantic and monosemantic features. PRISM samples sentences from the top percentile activation distribution, clusters them in embedding space, and uses an LLM to generate labels for each concept cluster. We benchmark PRISM across various layers and architectures, showing how polysemanticity and interpretability shift through the model. In exploring the concept space, we use PRISM to characterize more complex components, finding and interpreting patterns that specific attention heads or groups of neurons respond to. Our findings show that the PRISM framework not only provides multiple human interpretable descriptions for neurons but also aligns with the human interpretation of polysemanticity.
The repository is organized for ease of use:
assets/explanations/– Pre-computed feature descriptions from various feature description methods.descriptions/– Feature descriptions generated with PRISM.generated_text/– Concept text samples generated for evaluation purposes.notebooks/– Contains a Jupyter notebook for reproducing the benchmark table and plots shown in the paper.src/– Core source code, including all necessary functions for running feature description and evaluation.
Install the necessary packages using the provided requirements.txt:
pip install -r requirements.txtFirst, set paramters in src/utils/config.py or use default parameters.
This script outputs multiple feature descriptions based on percentile sampling and clustering for one feature.
python src/feature_description.pyTo generate descriptions for multiple features define EXPLAIN_FILE in config.py and run:
python src/run_feature_description.pyGenerated feature descriptions can be found in descriptions folder.
Evaluate feature descriptions with CoSy scores:
python src/evaluation.pyGenerated concept samples can be found in generated_text folder.
Evaluate all feature descriptions per feature with polysemanticity score (cosine similarity), max AUC, and max MAD.
python src/meta_evaluation.pyAll evaluation scores can be found in results folder.
To generate meta-labels for concepts found in feature descriptions run:
python src/run_concept_summary.pyAll meta-label results can be found in metalabels/ folder.
If you find this work interesting or useful in your research, use the following Bibtex annotation to cite us:
@misc{kopf2025prism,
title={Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework},
author={Laura Kopf and Nils Feldhus and Kirill Bykov and Philine Lou Bommer and Anna Hedström and Marina M. -C. Höhne and Oliver Eberle},
year={2025},
eprint={2506.15538},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.15538},
}This work is in review.
We hope our repository is beneficial to your work and research. If you have any feedback, questions, or ideas, please feel free to raise an issue in this repository. Alternatively, you can reach out to us directly via email for more in-depth discussions or suggestions.
📧 Contact us:
- Laura Kopf: kopf[at]tu-berlin.de
Thank you for your interest and support!
