Add new paper: #45

wyzh0912 · 2025-02-23T07:55:13Z

Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference

2025-01-22

arXiv

Evaluator Heads

Innovation: The paper introduces EHPC, a training-free method that leverages specific attention heads, termed evaluator heads, to efficiently compress prompts by retaining only significant tokens in long-context transformer inference, thus reducing computational costs and improving performance.
Tasks: The study involves identifying evaluator heads in transformer-based LLMs through pilot experiments with synthetic data, applying these heads for prompt compression across benchmarks like LongBench and ZeroSCROLLS, and evaluating the method's efficiency in reducing API costs and accelerating long-context inference.
Significant Result: EHPC achieves state-of-the-art performance in prompt compression benchmarks, effectively reducing API costs and memory usage while maintaining competitive results compared to key-value cache-based methods, improving direct inference performance by up to 40% on question-answering datasets.

Provide feedback