bench: add tail-latency benchmark for encode_ordinary_batch#548
Open
alobroke wants to merge 3 commits into
Open
bench: add tail-latency benchmark for encode_ordinary_batch#548alobroke wants to merge 3 commits into
alobroke wants to merge 3 commits into
Conversation
Adds scripts/benchmark_tail_latency.py to measure median and worst-of-N wall-clock time for encode_ordinary_batch across multiple synthetic corpora and thread counts. Motivated by issue openai#530, which reported 1.1x-7.6x tail spikes on a 32-core box. The existing benchmark.py only measures throughput; this script surfaces the worst-of-N latency that throughput numbers hide. Features: - Four synthetic corpora (english prose, python source, multilingual+emoji, random ascii) - Configurable runs, batch size, encoding, and thread counts via CLI flags - Outputs a table with median ms, worst ms, and worst/median ratio - No external data files required — corpora are generated at runtime Usage: python scripts/benchmark_tail_latency.py python scripts/benchmark_tail_latency.py --runs 10 --batch-size 256 --encoding o200k_base --threads 1,4,8
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivated by #530, which reported worst-of-10 tail spikes of 1.1×–7.6×
over median on encode_ordinary_batch. The issue author offered to PR a
benchmark harness — this is that harness.
Problem
The existing scripts/benchmark.py measures throughput only (bytes/sec).
Throughput numbers hide tail latency completely — a run that takes 7×
longer than median still averages out fine across many runs.
What this adds
scripts/benchmark_tail_latency.py— a self-contained tail-latency harness that:--runs,--batch-size,--encoding,--threadsExample output
encoding: o200k_base | batch_size: 64 | runs: 10
── num_threads=8 ──────────────────────────────────────
corpus tokens/batch median ms worst ms worst/med
english prose 2,560,000 240 980 4.1x
python source 4,480,000 580 950 1.6x
multilingual+emoji 5,120,000 1020 2100 2.1x
random ascii 7,680,000 680 780 1.1x
Usage
Fixes #530 (benchmark harness portion).