This repository contains the code and experiment configurations for reproducing the results in Tracking Universal Features Through Fine-Tuning and Model Merging.
In the paper, we study the following toy models:
- BabyPython: single-layer Mistral model trained on a mixture of BabyML and the Python subset of the stack.
- TinyStories: BabyPython fine-tuned on TinyStories.
- Lua: BabyPython fine-tuned on Lua subset of TheStack.
- LuaStories-merge: spherical linear interpolation of Lua and TinyStories models at
t = 0.58
.
Conda 🐍
conda env create -f conda.yaml
conda activate feature-dynamics
Dependencies 📦
pip install pipx
pipx install poetry
poetry install
This repository contains all of the components built throughout the iteration on the research and paper.
Train decoder-only models.
poetry run python training/transformer/train.py <experiment.toml>
Train sparse autoencoder.
poetry run python training/autoencoder/train.py <experiment.toml>
Using TransformerLens's HookedTransformer
(specifically via this fork to add Mistral hooks) to train sparse autoencoders.
Evaluation of pretrained autoencoders.
This module contains functionality to make target models use autoencoder reconstructions in place of existing activations, by using a forward pass hook.
Interpolate model weights using Mergekit.
poetry run python interpolation/interpolate.py <experiment.toml>
Merge models using Mergekit.