perf research: fetch#1
Draft
crowlbot wants to merge 4 commits into
Draft
Conversation
Cross-runtime HTTP servers, fetch client driver, and Headers/Request/Response microbenches for comparing Deno against Node 22 LTS and Bun 1.3.
…ess, results analyzer
- results.jsonl: wrk runs of hello/headers/echo/bigbody on Deno/Node/Bun - micro_results.jsonl: Headers + Request/Response per-op timings - profiles/*.prof.txt + *.v8.log.gz: V8 --prof attribution for the headers and request/response microbenches, plus a server-mode prof sampled at ~40k rps on GET /hello. - run_servers.sh: clean lingering listeners, fix POST body in wrk lua, parse Latency Distribution p99 correctly. - versions.txt: deno 2.7.14, node v22.22.2, bun 1.3.14.
This was referenced May 18, 2026
… measurements Adds a quantitative comparison of TypedArrayPrototypeSlice (current extractBody defensive memcpy) against ArrayBuffer.prototype.transfer (proposed architectural replacement) on this host's Deno 2.8.0 release build, and a landing-path analysis for upstream. Key numbers (12 runs × 10k iters, median): - 1 MB: slice 559 us vs transfer 208 us — 2.69x speedup - 4 MB: slice 2.37 ms vs transfer 826 us — 2.87x speedup The ~350 us saved per 1 MB response is enough to close the parent report's /bigbody Deno-vs-Node rps gap (1246 -> ~2200 theoretical, within ~10 percent of Node 22). Recommended landing: opt-in ResponseInit.transfer flag, ~100-200 LoC, spec-compatible (additive), no behaviour change for code that does not opt in. Three other landing options ranked but rejected (Deno.serve auto-detection needs escape analysis; copy-in-Rust is same memcpy count; silent transfer is a breaking change).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
perf research: fetch
Macro performance research on Deno's implementation of
fetch/Request/Response/Headers(client + server, body consumption, streaming bodies).This PR contains only benchmark scripts and committed profile artifacts — no
production code changes. Bench scripts live under
tools/perf_research/fetch/; flamegraph-equivalent V8 prof attribution iscommitted under
tools/perf_research/fetch/profiles/. The report below is thedeliverable.
Methodology
unreliable, so the headline is same-host ratios vs Node 22 LTS and Bun.
--profin-process.perf/samplyneedkernel.perf_event_paranoid<=1but the container is locked at3andsysctlis denied even viasudo— so all attribution below is JS-side.Native-side attribution (hyper, op-dispatch, libc) is left unranked.
wrk -c64 -t4 -d10son four routes (hello / headers-count / echo-JSON / 1 MB body), plus
microbenches over Headers / Request / Response ops on portable JS.
Pinned:
deno 2.7.14 (v8 14.7.173.20-rusty),node v22.22.2,bun 1.3.14.Headline ratios — HTTP server (rps; higher is better)
GET /hello(≈25 B JSON)GET /headers(count headers, tiny resp)POST /echo(≈25 B JSON in, echoed)GET /bigbody(1 MB freshUint8Array)Reads:
ahead on the JSON echo and the hello route; Bun is ~65–75 % faster on the
same hardware.
/bigbodyis the outlier: Node is ~2× faster than Deno even though Denobeats Node on the tiny-body routes. The split shows up cleanly in the
microbenches below.
Headline ratios — microbench (ns/op; lower is better)
Headers
headers_construct_objheaders_construct_arrheaders_get_hitheaders_get_missheaders_set_freshheaders_append_freshheaders_iter(for…of over 12 entries)headers_has_hitheaders_has_missRequest / Response
response_construct_stringresponse_construct_u8response_construct_with_headers({headers:{...}})response_construct_reused_headers({headers: hdr})request_constructresponse_text_smallishresponse_json_smallishReads:
Response is undici's pure-JS implementation; the comparison is asymmetric).
response_construct_with_headersis the only path where it is roughly onpar with Bun (Deno 569 ns/op vs Bun 645 ns/op).
response_construct_reused_headers(3.1×) issurprising — passing an existing
Headersinstance is slower in Denothan passing an object literal. See hypothesis perf research: url #2 below.
Flamegraph attribution (V8
--prof)Full profiles committed at:
headers_micro.v8.log.gz)request_response_micro.v8.log.gz)server_hello.v8.log.gz) — sampled while serving 64-connectionGET /helloat ~40 k rps.Headers microbench top JS frames (34 551 ticks)
Bottom-up shows ~27 % of nonlib ticks land inside the
ArrayPrototypeSortcall that lives in
Headers[_iterableHeaders]'s getter. The next ~12 %lands in
StringToLowerCaseIntl, attributable tobyteLowerCasecalls inboth the iter rebuild and the per-op linear scans of
getHeader/has/set.Server
GET /hellotop JS frames (2 619 ticks; Deno binary = 52 %, JS = 43 %)The leading JS hotspot under live HTTP load is the
record<ByteString, ByteString>webidl converter — it runs on everynew Response(bytes, { headers: { "content-type": "…" } }).Where the cost lives
ext/fetch/20_headers.js:227-284_iterableHeadersgetter only caches whenguard === "immutable". ServerRequest.headersis "immutable" (ext/http/00_serve.ts:561), but most user-built Headers andResponse.headersare not..get/.has/.set/.delete/.appendlinear-scan +byteLowerCaseper entryext/fetch/20_headers.js:166-176, :381, :405, :321, :122[[name, value], …]with case preserved. Every op doesbyteLowerCase(query)+byteLowerCase(entry[0])per entry.ByteStringconverter runsisByteString(char-by-char loop) on every argext/webidl/00_webidl.js:426-447headers.get("name")etc.; also runs for every kv pair innew Response(_, {headers: obj}).extractBodyslicesArrayBufferView/ArrayBuffersource on everynew Response(bytes)ext/fetch/22_body.js:437-459TypedArrayPrototypeSlice(object)makes a defensive copy even when the caller is going to discard the original. ~1 MB copy on/bigbody.newInnerResponseallocates a 9-property object literal with closures on every Responseext/fetch/23_response.js:136-150CreateShallowObjectLiteralshows up as 3.6 % of JS time under server load.initializeAResponsedoes a linear scan to check for an existing Content-Type entryext/fetch/23_response.js:218-230InnerRequest.headerListallocates N 2-tuples byArrayPrototypePushper requestext/http/00_serve.ts:405-415[k,v,k,v,…]; JS re-pairs into nested arrays before exposing asHeaders.op_fetchre-parsesVec<(ByteString, ByteString)>headers toHeaderName/HeaderValueper sendext/fetch/lib.rs:513-520Ranked architectural hypotheses
Each entry: impact estimate × confidence based on profile attribution, the
ratio gap, and how commonly the code path appears in real-world fetch use.
"Single-digit-percent" wins are deliberately omitted per the prompt's scope.
H1 — Headers storage as
[[name, value], …]forces every op into an O(n) scan with twobyteLowerCasecalls per entry (HIGH × HIGH).get_hit3.2× Node / 8.2× Bun..has_miss2.6× Node /5.2× Bun. Bottom-up: 12.4 % nonlib in
StringToLowerCaseIntl, attributablealmost entirely to the
byteLowerCasecalls ingetHeader,has,set,delete,appendHeader.un-indexed (
ext/fetch/20_headers.js:223). Every op normalizes both thequery and each entry's name on every call. Bun stores Headers in a
hashmap with lowercase keys; undici stores entries with their lowercase
index pre-computed.
byteLowerCasecalls; a hashmap or pre-lowercased index keyed by internedlowercase names would cut
.get/.hasto constant time and drop theStringToLowerCaseIntl ticks to near zero. For the server hot path
(
Deno.servehandlers readingreq.headers.get("x")), this is the leadingper-request JS cost after webidl conversion. Conservative: 2-3× on
Headers-heavy handlers.
H2 —
_iterableHeadersrebuilds and sorts the iteration view on everyfor…offor non-immutable Headers (HIGH × HIGH)headers_iter73× Node / 15× Bun. Bottom-up: 8.8 % inArrayTimSort, 5.5 % in the sort callback at
20_headers.js:272, 5.2 % inStringLessThan, 4.6 % in KeyedStoreIC building the entries array.
ext/fetch/20_headers.js:227-284only triggers when
guard === "immutable".Response.headersis"response",user-built
new Headers(...)is"none", and any mutable Request createdin JS is
"request"— all of which fall through and rebuild + sort onevery
for…ofand on every internalheadersEntries(...)call (whichfans out into
JSON.stringify(req.headers)in user code, async-iteratorprotocol of Response headers, OpenTelemetry propagator extraction at
ext/http/00_serve.ts:667-682, etc.).(
append/set/delete) and reusing it across reads would land theper-iteration cost in the sub-microsecond range Node and Bun achieve.
Removes the ArrayTimSort + StringLessThan + KeyedStoreIC entries from the
prof entirely. Plausible 10–50× on the iter microbench; impact on
end-to-end server rps is smaller because Deno's HTTP path uses internal
flat-array access (
inner.headerList), but every userland iteration on aresponse-builder pattern (
for (const [k, v] of someHeaders)) is on thispath.
H3 —
extractBodyalways copies the caller'sUint8Array/ArrayBuffer(HIGH × MEDIUM)/bigbody(1 MB GET): Node 2.4 k rps, Deno 1.2 k rps —1.95× gap even though Deno wins on every smaller route. Server prof
shows 3.0 % nonlib in
TypedArrayPrototypeSliceand 4.0 % inCreateTypedArray.ext/fetch/22_body.js:437-459ends eachArrayBufferView/ArrayBuffer branch with
source = TypedArrayPrototypeSlice( object). This is a defensive copy — the bytes are written back throughop_http_set_response_body_bytes, but the slice happens unconditionallyeven when the caller is a
Deno.servehandler that allocated the bufferinline and immediately drops it.
copy doubles the byte-throughput ceiling, which is exactly the 2× gap
observed against Node. For typical (<8 KB) responses the cost is in the
noise. Closes the
/bigbodygap, but not the/hellogap.H4 —
record<ByteString, ByteString>webidl converter is the leading JS-side cost of everynew Response(_, { headers: {...} })andnew Headers({...})(MEDIUM × HIGH)GET /helloprof: 8.6 % nonlib inext/webidl/00_webidl.js:1101(createRecordConverter). This converterruns every time a handler returns
new Response("…", { headers: { "content-type": "…" } }).webidl.converters.ByteString(key)→DOMStringround-trip +isByteStringchar loop, then the same for the value, then
appendHeaderwhich itselfpaths through H1. The
{headers: hdr}form (passing an existing Headersinstance) is 3.14× slower than Bun in the microbench
(
response_construct_reused_headers1 385 ns/op vs Bun 441 ns/op),because it serializes-then-deserializes through the same path
instead of using a shared internal pointer.
{headers: {"content-type": "<known-token>"}}(the literal-string initpattern produced by virtually every handler) and a true zero-copy reuse
for
{headers: existingHeadersInstance}would knock 5-8 % of total CPU offthe server hot path. Smaller per call than H1/H2, but pervasive.
H5 —
InnerRequest.headerListre-pairs the op's flat[k,v,k,v,…]into N nested[name, value]tuples per request (MEDIUM × MEDIUM)individually huge, but
ext/http/00_serve.ts:405-415runs once perrequest that calls
req.headers. With N=10 typical request headersthat's 10 fresh 2-element arrays + the Headers wrapper + the
case-preserving entries machinery downstream.
(
ext/http/http_next.rs:466-533) returnsa flat string array specifically to be cheap on the V8 side; the JS layer
then pairs them up for
_headerListstorage, defeating the flat-arrayefficiency. A
headerListthat internally remained a flat array (or thatwas lazily indexed) would avoid the per-request N allocations.
allocations + GC pressure. Compounds with H1 because every subsequent
headers.get()then walks the nested representation.H6 —
newInnerResponseallocates a 9-property object literal with two closures on every response, even for the fast-path bytes route (LOW × MEDIUM)CreateShallowObjectLiteral.Smaller than H1-H4 but recurs at every response.
ext/fetch/23_response.js:136-150.The closures (
url(), etc.) are needed for spec-level lazy URL listhandling, but the fast path (server returning a fresh
new Response(bytes))never touches them.
few percent of the JS-side cost. Below the prompt's
"single-digit-percent" threshold individually, but worth mentioning because
it's a clear missing fast path. Listed as the lowest-ranked finding.
Other observations (not ranked — context only)
response_construct_stringis 127 ns/op in Deno vs 5 ns/op in Bun. Bunhas an intrinsic Response-from-string path; matching that would require
changes at the V8/op boundary, not in JS. Worth investigating as part of
the broader cross-cutting "small
new Uint8Array(N)allocation" findingthat the prior research tick traced to
rusty-v8-147.4.0/.gndisablingon-heap TypedArrays for embedder pointer stability — that finding will
live in its own cross-cutting writeup, but it is the underlying cause of
several of the smaller gaps observed here (e.g. the
bigbodyratio ispartially driven by libc malloc/calloc time).
fetchclient URL is parsed twice (ext/fetch/lib.rs:436after the JS-side
new URL()parse). Not a fetch-PR optimization — itbelongs in
perf-research/urland is mentioned here only to note thatthe architectural pattern of "JS parses, op re-parses" recurs across
multiple fetch surfaces (headers in H5, URL here).
Reproduction
See
tools/perf_research/fetch/README.md.cargo build --release --bin deno cd tools/perf_research/fetch bash run_micro.sh bash run_servers.sh 10 64 bash run_v8_prof.sh node analyze.jsThe microbench script reports ns/op;
run_servers.shwritesresults.jsonl;run_v8_prof.shwritesprofiles/*.prof.txt. Pinnedversions live in
versions.txt.