Intervention-Jailbreak

For full article please visit -> https://logits.substack.com/p/jailbreak-and-intervention-chronicles

This project investigates the detection and mitigation of "jailbroken modes" in Large Language Models (LLMs). By analyzing internal activations in response to malicious prompts, the aim is to identify patterns indicative of jailbreak attempts and develop strategies to prevent models from complying with such prompts.

Project Structure

collecting_jailbreaking_prompts.py: Script for gathering prompts that may induce jailbreak behavior in LLMs.
extracting_activations.py: Extracts and records the model's activations across all layers when processing various prompts.
display_data.py: Utilities for visualizing and analyzing the collected activation data.
tsne_vis.py: Performs t-SNE visualization to identify clusters within the activation data.
with_intervention.py: Implements intervention strategies aimed at reducing the likelihood of the model responding to malicious prompts.
prompts/: Directory containing various prompts used during testing and analysis.
pca_layer_images/ & tsne_test_layer_images/: Directories storing visualizations of the model's activations, including PCA and t-SNE plots.

Getting Started

Please follow this in order to replicate experiments

Clone the Repository:

git clone https://github.com/Luisibear98/intervention-jailbreak.git
cd intervention-jailbreak

Collect Jailbreaking Prompts:

Use the collecting_jailbreaking_prompts.py script to gather prompts that may induce jailbreak behavior.

Extract Model Activations:

Run extracting_activations.py to record the model's activations when processing the collected prompts.

Visualize Data:

Utilize display_data.py and tsne_vis.py to analyze and visualize the activation data.

Implement Intervention:

Apply the strategies in with_intervention.py to mitigate the model's compliance with malicious prompts.

Plots examples

The following shows activations before intervention on earlier layers, as can be seen, there are no clear differences between jailbreak and non-jailbreak prompts activations.

However, deeper layers shows more differences between jailbreak and non-jailbreak prompts activations.

After intervention, we have "corrected the activations", forzing the jailbroken prompts to shift to the non-jailbroken area.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
comparing_images		comparing_images
pca_layer_images		pca_layer_images
pca_layer_images_intervention		pca_layer_images_intervention
prompts		prompts
tsne_test_layer_images		tsne_test_layer_images
tsne_test_layer_images_intervention		tsne_test_layer_images_intervention
.gitignore		.gitignore
README.md		README.md
collecting_jailbreaking_prompts.py		collecting_jailbreaking_prompts.py
extracting_activations.py		extracting_activations.py
pca_vis.py		pca_vis.py
requirements.txt		requirements.txt
tsne_vis.py		tsne_vis.py
with_intervention.py		with_intervention.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Intervention-Jailbreak

Project Structure

Getting Started

Plots examples

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Luisibear98/intervention-jailbreak

Folders and files

Latest commit

History

Repository files navigation

Intervention-Jailbreak

Project Structure

Getting Started

Plots examples

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages