Skip to content

bench: add tail-latency benchmark for encode_ordinary_batch#548

Open
alobroke wants to merge 3 commits into
openai:mainfrom
alobroke:bench/tail-latency-benchmark
Open

bench: add tail-latency benchmark for encode_ordinary_batch#548
alobroke wants to merge 3 commits into
openai:mainfrom
alobroke:bench/tail-latency-benchmark

Conversation

@alobroke
Copy link
Copy Markdown

Motivated by #530, which reported worst-of-10 tail spikes of 1.1×–7.6×
over median on encode_ordinary_batch. The issue author offered to PR a
benchmark harness — this is that harness.

Problem

The existing scripts/benchmark.py measures throughput only (bytes/sec).
Throughput numbers hide tail latency completely — a run that takes 7×
longer than median still averages out fine across many runs.

What this adds

scripts/benchmark_tail_latency.py — a self-contained tail-latency harness that:

  • Measures median and worst-of-N wall-clock time per corpus
  • Reports the worst/median ratio so tail spikes are immediately visible
  • Tests four synthetic corpora generated at runtime (no data files needed):
    • english prose
    • python source
    • multilingual + emoji
    • random ascii
  • Accepts CLI flags for --runs, --batch-size, --encoding, --threads

Example output

encoding: o200k_base | batch_size: 64 | runs: 10
── num_threads=8 ──────────────────────────────────────
corpus tokens/batch median ms worst ms worst/med
english prose 2,560,000 240 980 4.1x
python source 4,480,000 580 950 1.6x
multilingual+emoji 5,120,000 1020 2100 2.1x
random ascii 7,680,000 680 780 1.1x

Usage

python scripts/benchmark_tail_latency.py
python scripts/benchmark_tail_latency.py --runs 10 --batch-size 256 --encoding o200k_base --threads 1,4,8

Fixes #530 (benchmark harness portion).

alobroke added 2 commits May 21, 2026 03:21
Adds scripts/benchmark_tail_latency.py to measure median and worst-of-N
wall-clock time for encode_ordinary_batch across multiple synthetic corpora
and thread counts.

Motivated by issue openai#530, which reported 1.1x-7.6x tail spikes on a 32-core
box. The existing benchmark.py only measures throughput; this script surfaces
the worst-of-N latency that throughput numbers hide.

Features:
- Four synthetic corpora (english prose, python source, multilingual+emoji, random ascii)
- Configurable runs, batch size, encoding, and thread counts via CLI flags
- Outputs a table with median ms, worst ms, and worst/median ratio
- No external data files required — corpora are generated at runtime

Usage:
    python scripts/benchmark_tail_latency.py
    python scripts/benchmark_tail_latency.py --runs 10 --batch-size 256 --encoding o200k_base --threads 1,4,8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

encode_ordinary_batch — reproducible multi-second tail stalls on 32-core box (o200k_base, num_threads=8)

1 participant