RQAE

RQAE (Residual Quantization AutoEncoder) is a model architecture to interpret LLMs. You can find more details here.

This repository consists of three main parts:

rqae/: The base model code that is portable and can be used in your own project. a. model.py: The RQAE model definition. b. feature.py: A single feature definition (what model it came from, what are its top activations and the intensities for those activations, etc.) c. llm.py: Adjusting a transformers LLM to be used for interpretability methods, e.g. early stopping layers and adding forward hooks to save activations. Currently, it is only defined for Gemma2 models, but you can see that it's very straightforward to extend that (just need to change the norm and denorm functions). d. gemmascope.py: Gemmascope model definition.
server/: Code for the server and frontend of the demo. The server is hosted in Modal, so a lot of the code is written for Modal specifically.
scripts/: Scripts to prepare your own dataset for use in the demo code. Also includes scripts to run evals on features.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
demo		demo
rqae		rqae
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback