🚀 | A custom RunPod Serverless Endpoint template that employs LLMlingua x Microsoft Phi-2 for prompt compression in seconds.
-
Navigate to RunPod Serverless RUNPOD
-
Create an endpoint.
-
Enter jlonge4/runpod-llmlingua:v3 as your image.
-
Checkout the test notebook for an example of sending a request for compressing context.
-
Send your compressed query to your LLM!
-
Alternatively deploy using my template here TEMPLATE
DOCKERHUB
| DOCKERHUB
{
"input": {
"context": "[context]",
"instruction": "You are a q/a bot who uses the provided context to answer a question",
"question": "What's the purpose of the tutorial?",
"target_tokens": 350,
}
}
- Wall time: 1.25 s
{
"compressed_prompt": "You are a question answering bot who uses the provided\n"
"context to answer a question\n"
"In this short will explore how Face be deployed in a\n"
"Docker Container and a service...\n"
"What's the purpose of the tutorial?",
"compressed_tokens": 788,
"origin_tokens": 2171,
"ratio": "2.8x",
"saving": "Saving $0.1 in GPT-4."
}