Skip to content
/ prism Public

PRISM is a multi-concept feature description framework which can identify and score polysemantic features.

Notifications You must be signed in to change notification settings

lkopf/prism

Repository files navigation



Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework

PyTorch

This repository contains the code and experiments for the paper Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework by Kopf et al., 2025.

Table of Contents

About

Unlike prior approaches that assign a single description per feature, PRISM (Polysemantic FeatuRe Identification and Scoring Method) provides more nuanced descriptions for both polysemantic and monosemantic features. PRISM samples sentences from the top percentile activation distribution, clusters them in embedding space, and uses an LLM to generate labels for each concept cluster. We benchmark PRISM across various layers and architectures, showing how polysemanticity and interpretability shift through the model. In exploring the concept space, we use PRISM to characterize more complex components, finding and interpreting patterns that specific attention heads or groups of neurons respond to. Our findings show that the PRISM framework not only provides multiple human interpretable descriptions for neurons but also aligns with the human interpretation of polysemanticity.

Repository Overview

The repository is organized for ease of use:

  • assets/explanations/ – Pre-computed feature descriptions from various feature description methods.
  • descriptions/ – Feature descriptions generated with PRISM.
  • generated_text/ – Concept text samples generated for evaluation purposes.
  • notebooks/ – Contains a Jupyter notebook for reproducing the benchmark table and plots shown in the paper.
  • src/ – Core source code, including all necessary functions for running feature description and evaluation.

Installation

Install the necessary packages using the provided requirements.txt:

pip install -r requirements.txt

Running Experiments

First, set paramters in src/utils/config.py or use default parameters.

1. Feature Descriptions

This script outputs multiple feature descriptions based on percentile sampling and clustering for one feature.

python src/feature_description.py

To generate descriptions for multiple features define EXPLAIN_FILE in config.py and run:

python src/run_feature_description.py

Generated feature descriptions can be found in descriptions folder.

2. Evaluation

Evaluate feature descriptions with CoSy scores:

python src/evaluation.py

Generated concept samples can be found in generated_text folder.

Evaluate all feature descriptions per feature with polysemanticity score (cosine similarity), max AUC, and max MAD.

python src/meta_evaluation.py

All evaluation scores can be found in results folder.

3. Meta-labels

To generate meta-labels for concepts found in feature descriptions run:

python src/run_concept_summary.py

All meta-label results can be found in metalabels/ folder.

Citation

If you find this work interesting or useful in your research, use the following Bibtex annotation to cite us:

@misc{kopf2025prism,
      title={Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework}, 
      author={Laura Kopf and Nils Feldhus and Kirill Bykov and Philine Lou Bommer and Anna Hedström and Marina M. -C. Höhne and Oliver Eberle},
      year={2025},
      eprint={2506.15538},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.15538}, 
}

This work is in review.

Thank you

We hope our repository is beneficial to your work and research. If you have any feedback, questions, or ideas, please feel free to raise an issue in this repository. Alternatively, you can reach out to us directly via email for more in-depth discussions or suggestions.

📧 Contact us:

  • Laura Kopf: kopf[at]tu-berlin.de

Thank you for your interest and support!

About

PRISM is a multi-concept feature description framework which can identify and score polysemantic features.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published