Skip to content

Commit d28fcc3

Browse files
authored
Merge pull request #38 from codelion/feat-add-strategic-classification
Initial Implementation of the strategic classifier
2 parents b724995 + e761a38 commit d28fcc3

File tree

7 files changed

+1771
-5
lines changed

7 files changed

+1771
-5
lines changed

README.md

Lines changed: 85 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ A flexible, adaptive classification system that allows for dynamic addition of n
1212
- 💾 Safe and efficient state persistence
1313
- 🔄 Prototype-based learning
1414
- 🧠 Neural adaptation layer
15+
- 🛡️ Strategic classification robustness
1516

1617
## Try Now
1718

@@ -95,16 +96,58 @@ more_labels = ["positive"] * 2
9596
classifier.add_examples(more_examples, more_labels)
9697
```
9798

99+
### Strategic Classification (Anti-Gaming)
100+
101+
```python
102+
# Enable strategic mode to defend against adversarial inputs
103+
config = {
104+
'enable_strategic_mode': True,
105+
'cost_function_type': 'linear',
106+
'cost_coefficients': {
107+
'sentiment_words': 0.5, # Cost to change sentiment-bearing words
108+
'length_change': 0.1, # Cost to modify text length
109+
'word_substitution': 0.3 # Cost to substitute words
110+
},
111+
'strategic_blend_regular_weight': 0.6, # Weight for regular predictions
112+
'strategic_blend_strategic_weight': 0.4 # Weight for strategic predictions
113+
}
114+
115+
classifier = AdaptiveClassifier("bert-base-uncased", config=config)
116+
classifier.add_examples(texts, labels)
117+
118+
# Robust predictions that consider potential manipulation
119+
text = "This product has amazing quality features!"
120+
121+
# Dual prediction (automatic blend of regular + strategic)
122+
predictions = classifier.predict(text)
123+
124+
# Pure strategic prediction (simulates adversarial manipulation)
125+
strategic_preds = classifier.predict_strategic(text)
126+
127+
# Robust prediction (assumes input may already be manipulated)
128+
robust_preds = classifier.predict_robust(text)
129+
130+
print(f"Dual: {predictions}")
131+
print(f"Strategic: {strategic_preds}")
132+
print(f"Robust: {robust_preds}")
133+
```
134+
98135
## How It Works
99136

100-
The system combines three key components:
137+
The system combines four key components:
101138

102139
1. **Transformer Embeddings**: Uses state-of-the-art language models for text representation
103140

104141
2. **Prototype Memory**: Maintains class prototypes for quick adaptation to new examples
105142

106143
3. **Adaptive Neural Layer**: Learns refined decision boundaries through continuous training
107144

145+
4. **Strategic Classification**: Defends against adversarial manipulation using game-theoretic principles. When strategic mode is enabled, the system:
146+
- Models potential strategic behavior of users trying to game the classifier
147+
- Uses cost functions to represent the difficulty of manipulating different features
148+
- Combines regular predictions with strategic-aware predictions for robustness
149+
- Provides multiple prediction modes: dual (blended), strategic (simulates manipulation), and robust (anti-manipulation)
150+
108151
## Requirements
109152

110153
- Python ≥ 3.8
@@ -115,6 +158,46 @@ The system combines three key components:
115158

116159
## Adaptive Classification with LLMs
117160

161+
### Strategic Classification Evaluation
162+
163+
We evaluated the strategic classification feature using the [AI-Secure/adv_glue](https://huggingface.co/datasets/AI-Secure/adv_glue) dataset's `adv_sst2` subset, which contains adversarially-modified sentiment analysis examples designed to test robustness against strategic manipulation.
164+
165+
#### Testing Setup
166+
- **Dataset**: 148 adversarial text samples (70% train / 30% test)
167+
- **Task**: Binary sentiment classification (positive/negative)
168+
- **Model**: answerdotai/ModernBERT-base with linear cost function
169+
- **Modes**: Regular, Dual (60%/40% blend), Strategic, and Robust prediction modes
170+
171+
#### Results Summary
172+
173+
| Prediction Mode | Accuracy | F1-Score | Performance Notes |
174+
|----------------|----------|----------|------------------|
175+
| Regular Classifier | 80.00% | 80.00% | Baseline performance |
176+
| **Strategic (Dual)** | **82.22%** | **82.12%** | **+2.22% improvement** |
177+
| Strategic (Pure) | 82.22% | 82.12% | Consistent with dual mode |
178+
| Robust Mode | 80.00% | 79.58% | Anti-manipulation focused |
179+
180+
#### Performance Under Attack
181+
182+
| Scenario | Regular Classifier | Strategic Classifier | Advantage |
183+
|----------|-------------------|---------------------|----------|
184+
| **Clean Data** | **80.00%** | **82.22%** | **+2.22%** |
185+
| **Manipulated Data** | **60.00%** | **82.22%** | **+22.22%** |
186+
| **Robustness** | **-20.00% drop** | **0.00% drop** | **+20.00% better** |
187+
188+
#### Key Insights
189+
190+
**Strategic Training Success**: The strategic classifier demonstrates robust performance across both clean and manipulated data, maintaining 82.22% accuracy regardless of input manipulation.
191+
192+
**Dual Benefit**: Unlike traditional adversarial defenses that sacrifice clean performance for robustness, our strategic classifier achieves:
193+
- **2.22% improvement** on clean data
194+
- **22.22% improvement** on manipulated data
195+
- **Perfect robustness** (no performance degradation under attack)
196+
197+
**Practical Impact**: The 30.34% F1-score improvement on manipulated data demonstrates significant real-world value for applications facing adversarial inputs.
198+
199+
**Use Cases**: Ideal for production systems requiring consistent performance under adversarial conditions - content moderation, spam detection, fraud prevention, and security-critical applications where gaming attempts are common.
200+
118201
### Hallucination Detector
119202

120203
The adaptive classifier can detect hallucinations in language model outputs, especially in Retrieval-Augmented Generation (RAG) scenarios. Despite incorporating external knowledge sources, LLMs often still generate content that isn't supported by the provided context. Our hallucination detector identifies when a model's output contains information that goes beyond what's present in the source material.
@@ -268,6 +351,7 @@ This real-world evaluation demonstrates that adaptive classification can signifi
268351

269352
## References
270353

354+
- [Strategic Classification](https://arxiv.org/abs/1506.06980)
271355
- [RouteLLM: Learning to Route LLMs with Preference Data](https://arxiv.org/abs/2406.18665)
272356
- [Transformer^2: Self-adaptive LLMs](https://arxiv.org/abs/2501.06252)
273357
- [Lamini Classifier Agent Toolkit](https://www.lamini.ai/blog/classifier-agent-toolkit)

0 commit comments

Comments
 (0)