You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: pkg/preprocessing/chat_completions/README.md
+39Lines changed: 39 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -91,3 +91,42 @@ The templating process (steps 1.1-1.4) handles the conversion from structured re
91
91
└── prompt := resp.RenderedChats[0]
92
92
└── Continue with existing pipeline: Tokenize → KV Block Keys → Pod Scoring
93
93
```
94
+
### Optimized Preprocessing Architecture
95
+
96
+
#### **Performance Optimizations**
97
+
98
+
##### **Single Python Interpreter**
99
+
-**Process-Level Initialization**: Single Python interpreter per process, initilization at EPP startup. Scalable, low overhead and reduces memory footprint
100
+
-**Thread-Safe Initialization**: Global locks prevent multiple initializations
101
+
102
+
##### **Function Caching**
103
+
-**Cached Python Functions**: `render_jinja_template` and `get_model_chat_template` cached globally
104
+
-**Module-Level Caching**: Python modules imported once and reused
105
+
-**Thread Safety**: GIL management for concurrent access
106
+
107
+
##### **Template Caching**
108
+
-**Model-Specific Templates**: Templates cached per model to avoid repeated fetching
109
+
-**Hugging Face Integration**: Efficient template retrieval using AutoTokenizer, matching vLLM's
110
+
111
+
112
+
113
+
## Experiment Overview & Results
114
+
115
+
### Benchmark Configuration:
116
+
117
+
-**Dataset**: ShareGPT conversations with variable length
118
+
-**Model**: 2 pods of Qwen/Qwen2.5-0.5B-Instruct
119
+
-**Load Pattern**: Progressive QPS from 3→4→5→6→8→10→12→15→20 QPS
120
+
-**Duration**: ~18 minutes total with progressive load increases
121
+
-**Input Distribution**: 600-800 tokens per request
122
+
-**Output Distribution**: 1-100 tokens per request
123
+
-**API Comparison**: Chat Completions vs Completions (head-to-head)
124
+
-**Success Rate**: 100% for both APIs across all load levels
0 commit comments