Skip to content

Commit 68d56a3

Browse files
authored
Update README.md (#116)
* Update README.md Updated pre-processing README. Signed-off-by: guygir <[email protected]> * Add files via upload Signed-off-by: guygir <[email protected]> --------- Signed-off-by: guygir <[email protected]>
1 parent 2bb0a20 commit 68d56a3

File tree

2 files changed

+39
-0
lines changed

2 files changed

+39
-0
lines changed

pkg/preprocessing/chat_completions/README.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,3 +91,42 @@ The templating process (steps 1.1-1.4) handles the conversion from structured re
9191
└── prompt := resp.RenderedChats[0]
9292
└── Continue with existing pipeline: Tokenize → KV Block Keys → Pod Scoring
9393
```
94+
### Optimized Preprocessing Architecture
95+
96+
#### **Performance Optimizations**
97+
98+
##### **Single Python Interpreter**
99+
- **Process-Level Initialization**: Single Python interpreter per process, initilization at EPP startup. Scalable, low overhead and reduces memory footprint
100+
- **Thread-Safe Initialization**: Global locks prevent multiple initializations
101+
102+
##### **Function Caching**
103+
- **Cached Python Functions**: `render_jinja_template` and `get_model_chat_template` cached globally
104+
- **Module-Level Caching**: Python modules imported once and reused
105+
- **Thread Safety**: GIL management for concurrent access
106+
107+
##### **Template Caching**
108+
- **Model-Specific Templates**: Templates cached per model to avoid repeated fetching
109+
- **Hugging Face Integration**: Efficient template retrieval using AutoTokenizer, matching vLLM's
110+
111+
112+
113+
## Experiment Overview & Results
114+
115+
### Benchmark Configuration:
116+
117+
- **Dataset**: ShareGPT conversations with variable length
118+
- **Model**: 2 pods of Qwen/Qwen2.5-0.5B-Instruct
119+
- **Load Pattern**: Progressive QPS from 3→4→5→6→8→10→12→15→20 QPS
120+
- **Duration**: ~18 minutes total with progressive load increases
121+
- **Input Distribution**: 600-800 tokens per request
122+
- **Output Distribution**: 1-100 tokens per request
123+
- **API Comparison**: Chat Completions vs Completions (head-to-head)
124+
- **Success Rate**: 100% for both APIs across all load levels
125+
126+
### Performance Analysis
127+
128+
![Performance Analysis](TTFT_TPOT_THROUGHPUT_TRIPANEL.png)
129+
130+
#### **Overhead Analysis**
131+
- **TTFT (Time to First Token)**: +10% increase (0.122s vs 0.111s) - **Negligible**
132+
- **ITL (Inter-Token Latency)**: +14% increase (0.0032s vs 0.0028s) - **Negligible**
129 KB
Loading

0 commit comments

Comments
 (0)