Skip to content

EleutherAI/attribute

 
 

Repository files navigation

attribute

Reproduction of Anthropic's Attribution Graphs.

Setup

# install uv if you don't have it
# curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv --seed
uv sync

Cache activations

This step is not strictly necessary. Given the weights of a transcoder, we need to compute activations on a large dataset to be able to select maximum-activating examples.

# scripts/cache_activations
MODEL=HuggingFaceTB/SmolLM2-135M
TRANSCODER="nev/SmolLM2-CLT-135M-73k-k32"
DATASET="--dataset_repo EleutherAI/fineweb-edu-dedup-10b --dataset_split train --n_tokens 10_000_000"
NAME="smollm-v1"

uv run python cache.py $MODEL $TRANSCODER --num_gpus 1 $DATASET --hookpoints layers.0.mlp layers.1.mlp layers.2.mlp layers.3.mlp layers.4.mlp layers.5.mlp layers.6.mlp layers.7.mlp layers.8.mlp layers.9.mlp layers.10.mlp layers.11.mlp layers.12.mlp layers.13.mlp layers.14.mlp layers.15.mlp layers.16.mlp layers.17.mlp layers.18.mlp layers.19.mlp layers.20.mlp layers.21.mlp layers.22.mlp layers.23.mlp layers.24.mlp layers.25.mlp layers.26.mlp layers.27.mlp layers.28.mlp layers.29.mlp --name $NAME --batch_size 16

Alternatively, you can download the cache from Huggingface:

cd results && git clone https://huggingface.co/nev/SmolLM2-CLT-135M-73k-k32-cache --depth=1

Web UI

We developed a gradio visualization for running attribution:

uv run gradio serve.py

Colab (WIP)

The web UI can be run from a Colab notebook.

Run in CLI

SCAN="smollm-v1"
PROMPT_NAME="smollm-v1"
PROMPT_TEXT="Michael Jordan plays the sport of"
uv run python -m attribute \
--cache_path=results/$SCAN/latents \
--name $PROMPT_NAME \
--scan "$SCAN" \
"$PROMPT_TEXT" \
--transcoder_path="nev/SmolLM2-CLT-135M-73k-k32" --model_name="HuggingFaceTB/SmolLM2-135M" \
uv run python -m attribute
cd attribution-graphs-frontend
uv run python serve.py 9999
ngrok http 9999

TODOs

  • Cross-verify with Anthropic implementation

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 44.1%
  • JavaScript 41.0%
  • CSS 8.2%
  • Shell 3.3%
  • HTML 2.4%
  • Jupyter Notebook 1.0%