Skip to content

Latest commit

 

History

History
42 lines (21 loc) · 2.16 KB

File metadata and controls

42 lines (21 loc) · 2.16 KB

This fork contains the solution attempt by Marius and me during the Apart Alignment Jam, see here for our report.

Write-up 1

Write-up 2

Mechanistic Interpretability Challenge

Challenge 1, MNIST CNN:

Use mechanistic interpretability tools to reverse engineer an MNIST CNN and send me a program for the labeling function it was trained on.

Hint 1: The labels are binary.

Hint 2: The network gets 95.58% accuracy on the test set.

Hint 3: The labeling function can be described in words in one sentence.

Hint 4: This image may be helpful.

mnist example

MNIST CNN challenge: MNIST CNN challenge -- Colab

Challenge 2, Transformer:

Use mechanistic interpretability tools to reverse engineer a transformer and send me a program for the labeling function it was trained on.

Hint 1: The labels are binary.

Hint 2: The network is trained on 50% of examples and gets 97.27% accuracy on the test half.

Hint 3: Here are the ground truth and learned labels. Notice how the mistakes the network makes are all near curvy parts of the decision boundary...

drawing

Transformer challenge: MNIST CNN challenge -- Colab

Rewards:

If you send me code for one of the two labeling functions along with a justified mechanisic interpretability explanation for it (e.g. in the form of a colab notebook), the prize is a $750 donation to a high-impact charity of your choice. So the total prize pool is $1,500 for both challenges. Thanks for Neel Nanda for contributing $500!