You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference
Published Date
2025-01-22
Source
arXiv
Head Name
Evaluator Heads
Summary
Innovation: The paper introduces EHPC, a training-free method that leverages specific attention heads, termed evaluator heads, to efficiently compress prompts by retaining only significant tokens in long-context transformer inference, thus reducing computational costs and improving performance.
Tasks: The study involves identifying evaluator heads in transformer-based LLMs through pilot experiments with synthetic data, applying these heads for prompt compression across benchmarks like LongBench and ZeroSCROLLS, and evaluating the method's efficiency in reducing API costs and accelerating long-context inference.
Significant Result: EHPC achieves state-of-the-art performance in prompt compression benchmarks, effectively reducing API costs and memory usage while maintaining competitive results compared to key-value cache-based methods, improving direct inference performance by up to 40% on question-answering datasets.
The text was updated successfully, but these errors were encountered:
Title
Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference
Published Date
2025-01-22
Source
arXiv
Head Name
Evaluator Heads
Summary
The text was updated successfully, but these errors were encountered: