This fork contains the solution attempt by Marius and me during the Apart Alignment Jam, see here for our report.

Mechanistic Interpretability Challenge

Challenge 1, MNIST CNN:

Use mechanistic interpretability tools to reverse engineer an MNIST CNN and send me a program for the labeling function it was trained on.

Hint 1: The labels are binary.

Hint 2: The network gets 95.58% accuracy on the test set.

Hint 3: The labeling function can be described in words in one sentence.

Hint 4: This image may be helpful.

MNIST CNN challenge:

Challenge 2, Transformer:

Use mechanistic interpretability tools to reverse engineer a transformer and send me a program for the labeling function it was trained on.

Hint 1: The labels are binary.

Hint 2: The network is trained on 50% of examples and gets 97.27% accuracy on the test half.

Hint 3: Here are the ground truth and learned labels. Notice how the mistakes the network makes are all near curvy parts of the decision boundary...

Transformer challenge:

Rewards:

If you send me code for one of the two labeling functions along with a justified mechanisic interpretability explanation for it (e.g. in the form of a colab notebook), the prize is a $750 donation to a high-impact charity of your choice. So the total prize pool is $1,500 for both challenges. Thanks for Neel Nanda for contributing $500!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Mechanistic Interpretability Challenge

Challenge 1, MNIST CNN:

Challenge 2, Transformer:

Rewards:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Mechanistic Interpretability Challenge

Challenge 1, MNIST CNN:

Challenge 2, Transformer:

Rewards: