GitHub - lkopf/prism: PRISM is a multi-concept feature description framework which can identify and score polysemantic features.

Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework

PyTorch

This repository contains the code and experiments for the paper Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework by Kopf et al., 2025.

About

Unlike prior approaches that assign a single description per feature, PRISM (Polysemantic FeatuRe Identification and Scoring Method) provides more nuanced descriptions for both polysemantic and monosemantic features. PRISM samples sentences from the top percentile activation distribution, clusters them in embedding space, and uses an LLM to generate labels for each concept cluster. We benchmark PRISM across various layers and architectures, showing how polysemanticity and interpretability shift through the model. In exploring the concept space, we use PRISM to characterize more complex components, finding and interpreting patterns that specific attention heads or groups of neurons respond to. Our findings show that the PRISM framework not only provides multiple human interpretable descriptions for neurons but also aligns with the human interpretation of polysemanticity.

Repository Overview

The repository is organized for ease of use:

assets/explanations/ – Pre-computed feature descriptions from various feature description methods.
descriptions/ – Feature descriptions generated with PRISM.
generated_text/ – Concept text samples generated for evaluation purposes.
notebooks/ – Contains a Jupyter notebook for reproducing the benchmark table and plots shown in the paper.
src/ – Core source code, including all necessary functions for running feature description and evaluation.

Installation

Install the necessary packages using the provided requirements.txt:

pip install -r requirements.txt

Running Experiments

First, set paramters in src/utils/config.py or use default parameters.

1. Feature Descriptions

This script outputs multiple feature descriptions based on percentile sampling and clustering for one feature.

python src/feature_description.py

To generate descriptions for multiple features define EXPLAIN_FILE in config.py and run:

python src/run_feature_description.py

Generated feature descriptions can be found in descriptions folder.

2. Evaluation

Evaluate feature descriptions with CoSy scores:

python src/evaluation.py

Generated concept samples can be found in generated_text folder.

Evaluate all feature descriptions per feature with polysemanticity score (cosine similarity), max AUC, and max MAD.

python src/meta_evaluation.py

All evaluation scores can be found in results folder.

3. Meta-labels

To generate meta-labels for concepts found in feature descriptions run:

python src/run_concept_summary.py

All meta-label results can be found in metalabels/ folder.

Citation

If you find this work interesting or useful in your research, use the following Bibtex annotation to cite us:

@misc{kopf2025prism,
      title={Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework}, 
      author={Laura Kopf and Nils Feldhus and Kirill Bykov and Philine Lou Bommer and Anna Hedström and Marina M. -C. Höhne and Oliver Eberle},
      year={2025},
      eprint={2506.15538},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.15538}, 
}

This work is in review.

Thank you

We hope our repository is beneficial to your work and research. If you have any feedback, questions, or ideas, please feel free to raise an issue in this repository. Alternatively, you can reach out to us directly via email for more in-depth discussions or suggestions.

📧 Contact us:

Laura Kopf: kopf[at]tu-berlin.de

Thank you for your interest and support!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework

Table of Contents

About

Repository Overview

Installation

Running Experiments

1. Feature Descriptions

2. Evaluation

3. Meta-labels

Citation

Thank you

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
descriptions		descriptions
generated_text		generated_text
notebooks		notebooks
results		results
src		src
README.md		README.md
prism_framework.jpg		prism_framework.jpg
prism_logo.svg		prism_logo.svg
requirements.txt		requirements.txt

lkopf/prism

Folders and files

Latest commit

History

Repository files navigation

Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework

Table of Contents

About

Repository Overview

Installation

Running Experiments

1. Feature Descriptions

2. Evaluation

3. Meta-labels

Citation

Thank you

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages