Skip to content

Conversation

@harshaljanjani
Copy link
Contributor

Re-raised with custom tokenizer and chat-template support after repo mishaps over the past few days.

This PR adds support for Microsoft’s Phi-3 model ++ improves the robustness of the test suite for models that do not support embeddings. Phi-3-Mini-4K-Instruct is a 3.8B parameter model and comes in 4K and 128K context-length variants. Turns out FP16 works well for Phi-3, but performance degrades sharply with INT8; results are shared below.

  1. Phi-3 Integration -- Support for Phi-3's architecture. Updated convert_hf.py to handle these weight configurations during conversion.
  2. Test Suite -- Fixed the embeddings test, stopping segfaults when testing incompatible larger model types.

Testing Environment (For Repro):

  • GCP VM: t2a-standard-16 (16 vCPU ARM64, 64GB RAM), Debian 12 ARM, Ampere Altra.

Results

Metric INT8 FP16 INT8
TTFT 0.47s 4.70s 10x faster
Prefill 59.1 tok/s 6.0 tok/s 10x faster
Decode 16.4 tok/s 1.7 tok/s 10x faster
RAM 3.6 GB 7.3 GB 50% less
Quality Inconsistent Reliable -

Phi-3-mini-4k-instruct (INT8)

╔════════════════════════════════╗
║    Running Engine Tests        ║
╚════════════════════════════════╝
╔════════════════════════════════╗
║   STREAMING & FOLLOW-UP TEST   ║
╚════════════════════════════════╝
[Turn 1]
User: My name is Henry Ndubuaku, how are you?
Assistant: Hello Henry, it's lovely to assist you. May I know how can I help?
[Results - Turn 1]
├─ TTFT: 0.47 sec
├─ Prefill: 59.1 toks/sec
├─ Decode: 16.4 toks/sec
└─ RAM: 3628.1 MB
[Turn 2]
User: What is my name?
Assistant: Your name is Ndubuaku, but some people might call me Microsoft Assistant for assistance purposes. How can I assist you further?
[Results - Turn 2]
├─ TTFT: 0.72 sec
├─ Prefill: 78.8 toks/sec
├─ Decode: 13.8 toks/sec
└─ RAM: 3710.9 MB
✓ PASS │ streaming
╔════════════════════════════════╗
║       100 CONTEXT TEST         ║
╚════════════════════════════════╝
Response: This table appears to be presenting a series of numerical values which appear random without any discernible pattern or formula that relates one value with another directly through mathematical operations (such as addition/multiplication). Thus:   | x      | y                              ||-------|-------------------------||0   43       217           1986        25    3      4          ------! In order for these numbers to make sense, they should follow some relationship between 'x [→ stopped]
[Results]
├─ TTFT: 2.25 sec
├─ Prefill: 105.0 toks/sec
├─ Decode: 10.6 toks/sec
└─ RAM: 3975.8 MB
└─ Status: PASSED ✓
✓ PASS │ 100_context
╔════════════════════════════════╗
║      1K CONTEXT TEST           ║
╚════════════════════════════════╝
Response: From examining Dataset C's daily total precipitation measurements in inches over a month period starting from the first day labeled "Data 0.5," it is evident an increasing trend where each subsequent increment results not only by whole multiples of ten but also shows fluctcy within individual increases and overall growth rates that are slightly outpacing what one would expect purely on linearity progression due solely to multiplicative escalation. Initial increments suggest relatively stable clim [→ stopped]
[Results]
├─ TTFT: 16.11 sec
├─ Prefill: 75.3 toks/sec
├─ Decode: 7.9 toks/sec
└─ RAM: 5596.1 MB
└─ Status: PASSED ✓
✓ PASS │ 1k_context

Phi-3-mini-4k-instruct (FP16)

╔════════════════════════════════╗
║    Running Engine Tests        ║
╚════════════════════════════════╝
╔════════════════════════════════╗
║   STREAMING & FOLLOW-UP TEST   ║
╚════════════════════════════════╝
[Turn 1]
User: My name is Henry Ndubuaku, how are you?
Assistant: I'm doing well. Thank you for asking, Henry Ndubuaku! How are the circumstances surrounding your question?
[Results - Turn 1]
├─ TTFT: 4.70 sec
├─ Prefill: 6.0 toks/sec
├─ Decode: 1.7 toks/sec
└─ RAM: 7267.1 MB
[Turn 2]
User: What is my name?
Assistant: Your name is Henry Ndubuaku. Well-identified once again!
[Results - Turn 2]
├─ TTFT: 9.70 sec
├─ Prefill: 6.4 toks/sec
├─ Decode: 1.7 toks/sec
└─ RAM: 7283.9 MB
✓ PASS │ streaming
╔════════════════════════════════╗
║       100 CONTEXT TEST         ║
╚════════════════════════════════╝
Response: The given data represents an increasing sequence of floating-point values that correspond to 1/4π intervals from Data 0 (wherein each value corresponds to a quarter interval along a unit circle), starting at zero and approaching one without ever actually reaching it due its representation as limited decimal places up until four digits:<0x0A><0x0A>Data 0 = 'zero'. This suggests no angle measurement has occurred yet - think about initial stage or reference point in trigonometry terms within our circular context i.e [→ stopped]
[Results]
├─ TTFT: 35.78 sec
├─ Prefill: 6.6 toks/sec
├─ Decode: 1.6 toks/sec
└─ RAM: 7584.1 MB
└─ Status: PASSED ✓
✓ PASS │ 100_context

cc: @HenryNdubuaku

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant