perf research: fetch by crowlbot · Pull Request #1 · crowlbot/deno

crowlbot · 2026-05-14T14:47:08Z

perf research: fetch

Macro performance research on Deno's implementation of fetch / Request /
Response / Headers (client + server, body consumption, streaming bodies).

This PR contains only benchmark scripts and committed profile artifacts — no
production code changes. Bench scripts live under
tools/perf_research/fetch/; flamegraph-equivalent V8 prof attribution is
committed under tools/perf_research/fetch/profiles/. The report below is the
deliverable.

Methodology

Ratios over absolute numbers. Host is Docker on Proxmox; absolute rps is
unreliable, so the headline is same-host ratios vs Node 22 LTS and Bun.
Flamegraph attribution. V8 --prof in-process. perf / samply need
kernel.perf_event_paranoid<=1 but the container is locked at 3 and
sysctl is denied even via sudo — so all attribution below is JS-side.
Native-side attribution (hyper, op-dispatch, libc) is left unranked.
Realistic workloads. Concurrent HTTP server under wrk -c64 -t4 -d10s
on four routes (hello / headers-count / echo-JSON / 1 MB body), plus
microbenches over Headers / Request / Response ops on portable JS.

Pinned: deno 2.7.14 (v8 14.7.173.20-rusty), node v22.22.2, bun 1.3.14.

Headline ratios — HTTP server (rps; higher is better)

route	Deno	Node	Bun	Node/Deno	Bun/Deno
`GET /hello` (≈25 B JSON)	39 775	32 226	69 190	0.81	1.74
`GET /headers` (count headers, tiny resp)	33 803	37 095	55 933	1.10	1.65
`POST /echo` (≈25 B JSON in, echoed)	30 408	18 456	51 144	0.61	1.68
`GET /bigbody` (1 MB fresh `Uint8Array`)	1 246	2 435	1 082	1.95	0.87

Reads:

On small request/response paths, Deno is competitive with Node, slightly
ahead on the JSON echo and the hello route; Bun is ~65–75 % faster on the
same hardware.
/bigbody is the outlier: Node is ~2× faster than Deno even though Deno
beats Node on the tiny-body routes. The split shows up cleanly in the
microbenches below.

Headline ratios — microbench (ns/op; lower is better)

Headers

op	Deno	Node	Bun	Deno/Node	Deno/Bun
`headers_construct_obj`	4 245	3 691	1 282	1.15	3.31
`headers_construct_arr`	4 128	4 946	1 638	0.84	2.52
`headers_get_hit`	303	95	37	3.20	8.24
`headers_get_miss`	257	96	49	2.69	5.25
`headers_set_fresh`	179	271	337	0.66	0.53
`headers_append_fresh`	231	455	409	0.51	0.56
`headers_iter` (for…of over 12 entries)	28 996	398	1 991	72.9	14.6
`headers_has_hit`	141	98	30	1.45	4.67
`headers_has_miss`	245	93	47	2.63	5.24

Request / Response

op	Deno	Node	Bun
`response_construct_string`	127	7 632	5
`response_construct_u8`	1 069	10 480	256
`response_construct_with_headers` (`{headers:{...}}`)	569	8 519	645
`response_construct_reused_headers` (`{headers: hdr}`)	1 385	8 539	441
`request_construct`	3 640	11 859	1 333
`response_text_smallish`	1 626	10 930	735
`response_json_smallish`	3 938	12 592	1 744

Reads:

Deno is 6–60× faster than Node on Response/Request construction (Node's
Response is undici's pure-JS implementation; the comparison is asymmetric).
Against Bun, Deno is 2–25× slower on Response construction, and
response_construct_with_headers is the only path where it is roughly on
par with Bun (Deno 569 ns/op vs Bun 645 ns/op).
The Bun ↔ Deno gap on response_construct_reused_headers (3.1×) is
surprising — passing an existing Headers instance is slower in Deno
than passing an object literal. See hypothesis perf research: url #2 below.

Flamegraph attribution (V8 `--prof`)

Full profiles committed at:

profiles/headers_micro.prof.txt (raw log: headers_micro.v8.log.gz)
profiles/request_response_micro.prof.txt (raw log: request_response_micro.v8.log.gz)
profiles/server_hello.prof.txt (raw log: server_hello.v8.log.gz) — sampled while serving 64-connection GET /hello at ~40 k rps.

Headers microbench top JS frames (34 551 ticks)

   ticks  total  nonlib   name
   3068    8.9%   12.4%   Builtin: StringToLowerCaseIntl          // byteLowerCase
   3047    8.8%   12.3%   Builtin: ArrayTimSort                   // _iterableHeaders sort
   1913    5.5%    7.7%   JS: *<anonymous> ext:deno_fetch/20_headers.js:272:7   // sort cb
   1808    5.2%    7.3%   Builtin: StringLessThan                 // inside the sort
   1581    4.6%    6.4%   Builtin: KeyedStoreIC_Megamorphic       // entries[i][...] writes
   1356    3.9%    5.5%   Builtin: KeyedLoadIC_Megamorphic        // entries[i][...] reads
    962    2.8%    3.9%   JS: *<anonymous> ext:deno_webidl/00_webidl.js:1101:10 // record<ByteString,ByteString>
    434    1.3%    1.8%   JS: *<anonymous> ext:deno_webidl/00_webidl.js:932:19  // isByteString

Bottom-up shows ~27 % of nonlib ticks land inside the ArrayPrototypeSort
call that lives in Headers[_iterableHeaders]'s getter. The next ~12 %
lands in StringToLowerCaseIntl, attributable to byteLowerCase calls in
both the iter rebuild and the per-op linear scans of getHeader/has/set.

Server `GET /hello` top JS frames (2 619 ticks; Deno binary = 52 %, JS = 43 %)

     98    3.7%    8.6%   JS: *<anonymous> ext:deno_webidl/00_webidl.js:1101:10 // record<ByteString,ByteString>
     57    2.2%    5.0%   Builtin: KeyedStoreIC
     48    1.8%    4.2%   Builtin: LoadIC
     45    1.7%    4.0%   Builtin: CreateTypedArray              // new Uint8Array(...)
     41    1.6%    3.6%   Builtin: CreateShallowObjectLiteral    // newInnerResponse(), etc.
     34    1.3%    3.0%   Builtin: TypedArrayPrototypeSlice      // extractBody's copy
     11    0.4%    1.0%   JS: *get ext:deno_fetch/22_body.js:290:10  // bodyUsed getter

The leading JS hotspot under live HTTP load is the record<ByteString, ByteString> webidl converter — it runs on every
new Response(bytes, { headers: { "content-type": "…" } }).

Where the cost lives

Finding	File:line	Notes
Headers iter rebuilds + sorts every for-of (sort dominates the prof)	`ext/fetch/20_headers.js:227-284`	`_iterableHeaders` getter only caches when `guard === "immutable"`. Server `Request.headers` is "immutable" (`ext/http/00_serve.ts:561`), but most user-built Headers and `Response.headers` are not.
Headers `.get`/`.has`/`.set`/`.delete`/`.append` linear-scan + `byteLowerCase` per entry	`ext/fetch/20_headers.js:166-176`, :381, :405, :321, :122	Storage is `[[name, value], …]` with case preserved. Every op does `byteLowerCase(query)` + `byteLowerCase(entry[0])` per entry.
WebIDL `ByteString` converter runs `isByteString` (char-by-char loop) on every arg	`ext/webidl/00_webidl.js:426-447`	Adds O(name.length) per `headers.get("name")` etc.; also runs for every kv pair in `new Response(_, {headers: obj})`.
`extractBody` slices `ArrayBufferView` / `ArrayBuffer` source on every `new Response(bytes)`	`ext/fetch/22_body.js:437-459`	`TypedArrayPrototypeSlice(object)` makes a defensive copy even when the caller is going to discard the original. ~1 MB copy on `/bigbody`.
`newInnerResponse` allocates a 9-property object literal with closures on every Response	`ext/fetch/23_response.js:136-150`	`CreateShallowObjectLiteral` shows up as 3.6 % of JS time under server load.
`initializeAResponse` does a linear scan to check for an existing Content-Type entry	`ext/fetch/23_response.js:218-230`	Pays the case-mismatched linear scan even when the headers were just built fresh and are known not to contain Content-Type.
`InnerRequest.headerList` allocates N 2-tuples by `ArrayPrototypePush` per request	`ext/http/00_serve.ts:405-415`	Op returns flat `[k,v,k,v,…]`; JS re-pairs into nested arrays before exposing as `Headers`.
`op_fetch` re-parses `Vec<(ByteString, ByteString)>` headers to `HeaderName`/`HeaderValue` per send	`ext/fetch/lib.rs:513-520`	JS already iterates Headers to flatten before the op call; Rust then iterates again.

Ranked architectural hypotheses

Each entry: impact estimate × confidence based on profile attribution, the
ratio gap, and how commonly the code path appears in real-world fetch use.
"Single-digit-percent" wins are deliberately omitted per the prompt's scope.

H1 — Headers storage as `[[name, value], …]` forces every op into an O(n) scan with two `byteLowerCase` calls per entry (HIGH × HIGH)

Evidence. .get_hit 3.2× Node / 8.2× Bun. .has_miss 2.6× Node /
5.2× Bun. Bottom-up: 12.4 % nonlib in StringToLowerCaseIntl, attributable
almost entirely to the byteLowerCase calls in getHeader, has, set,
delete, appendHeader.
Architectural root. Headers stores entries case-preserved and
un-indexed (ext/fetch/20_headers.js:223). Every op normalizes both the
query and each entry's name on every call. Bun stores Headers in a
hashmap with lowercase keys; undici stores entries with their lowercase
index pre-computed.
Estimated impact if fixed. Per-op cost dominated by the scan + the two
byteLowerCase calls; a hashmap or pre-lowercased index keyed by interned
lowercase names would cut .get/.has to constant time and drop the
StringToLowerCaseIntl ticks to near zero. For the server hot path
(Deno.serve handlers reading req.headers.get("x")), this is the leading
per-request JS cost after webidl conversion. Conservative: 2-3× on
Headers-heavy handlers.

H2 — `_iterableHeaders` rebuilds and sorts the iteration view on every `for…of` for non-immutable Headers (HIGH × HIGH)

Evidence. headers_iter 73× Node / 15× Bun. Bottom-up: 8.8 % in
ArrayTimSort, 5.5 % in the sort callback at 20_headers.js:272, 5.2 % in
StringLessThan, 4.6 % in KeyedStoreIC building the entries array.
Architectural root. Cache at
ext/fetch/20_headers.js:227-284
only triggers when guard === "immutable". Response.headers is "response",
user-built new Headers(...) is "none", and any mutable Request created
in JS is "request" — all of which fall through and rebuild + sort on
every for…of and on every internal headersEntries(...) call (which
fans out into JSON.stringify(req.headers) in user code, async-iterator
protocol of Response headers, OpenTelemetry propagator extraction at
ext/http/00_serve.ts:667-682, etc.).
Estimated impact if fixed. Invalidating the cache only on mutations
(append/set/delete) and reusing it across reads would land the
per-iteration cost in the sub-microsecond range Node and Bun achieve.
Removes the ArrayTimSort + StringLessThan + KeyedStoreIC entries from the
prof entirely. Plausible 10–50× on the iter microbench; impact on
end-to-end server rps is smaller because Deno's HTTP path uses internal
flat-array access (inner.headerList), but every userland iteration on a
response-builder pattern (for (const [k, v] of someHeaders)) is on this
path.

H3 — `extractBody` always copies the caller's `Uint8Array` / `ArrayBuffer` (HIGH × MEDIUM)

Evidence. /bigbody (1 MB GET): Node 2.4 k rps, Deno 1.2 k rps —
1.95× gap even though Deno wins on every smaller route. Server prof
shows 3.0 % nonlib in TypedArrayPrototypeSlice and 4.0 % in
CreateTypedArray.
Architectural root.
ext/fetch/22_body.js:437-459 ends each
ArrayBufferView/ArrayBuffer branch with source = TypedArrayPrototypeSlice( object). This is a defensive copy — the bytes are written back through
op_http_set_response_body_bytes, but the slice happens unconditionally
even when the caller is a Deno.serve handler that allocated the buffer
inline and immediately drops it.
Estimated impact if fixed. For 1 MB responses, eliminating one full
copy doubles the byte-throughput ceiling, which is exactly the 2× gap
observed against Node. For typical (<8 KB) responses the cost is in the
noise. Closes the /bigbody gap, but not the /hello gap.

H4 — `record<ByteString, ByteString>` webidl converter is the leading JS-side cost of every `new Response(_, { headers: {...} })` and `new Headers({...})` (MEDIUM × HIGH)

Evidence. Server GET /hello prof: 8.6 % nonlib in
ext/webidl/00_webidl.js:1101 (createRecordConverter). This converter
runs every time a handler returns new Response("…", { headers: { "content-type": "…" } }).
Architectural root. Each kv pair pays:
webidl.converters.ByteString(key) → DOMString round-trip + isByteString
char loop, then the same for the value, then appendHeader which itself
paths through H1. The {headers: hdr} form (passing an existing Headers
instance) is 3.14× slower than Bun in the microbench
(response_construct_reused_headers 1 385 ns/op vs Bun 441 ns/op),
because it serializes-then-deserializes through the same path
instead of using a shared internal pointer.
Estimated impact if fixed. A typed fast path for
{headers: {"content-type": "<known-token>"}} (the literal-string init
pattern produced by virtually every handler) and a true zero-copy reuse
for {headers: existingHeadersInstance} would knock 5-8 % of total CPU off
the server hot path. Smaller per call than H1/H2, but pervasive.

H5 — `InnerRequest.headerList` re-pairs the op's flat `[k,v,k,v,…]` into N nested `[name, value]` tuples per request (MEDIUM × MEDIUM)

Evidence. Server prof: 5.0 % KeyedStoreIC + 4.2 % LoadIC are not
individually huge, but ext/http/00_serve.ts:405-415 runs once per
request that calls req.headers. With N=10 typical request headers
that's 10 fresh 2-element arrays + the Headers wrapper + the
case-preserving entries machinery downstream.
Architectural root. The Rust op
(ext/http/http_next.rs:466-533) returns
a flat string array specifically to be cheap on the V8 side; the JS layer
then pairs them up for _headerList storage, defeating the flat-array
efficiency. A headerList that internally remained a flat array (or that
was lazily indexed) would avoid the per-request N allocations.
Estimated impact if fixed. Per request: removes N (≈10) tuple
allocations + GC pressure. Compounds with H1 because every subsequent
headers.get() then walks the nested representation.

H6 — `newInnerResponse` allocates a 9-property object literal with two closures on every response, even for the fast-path bytes route (LOW × MEDIUM)

Evidence. Server prof: 3.6 % nonlib in CreateShallowObjectLiteral.
Smaller than H1-H4 but recurs at every response.
Architectural root. ext/fetch/23_response.js:136-150.
The closures (url(), etc.) are needed for spec-level lazy URL list
handling, but the fast path (server returning a fresh new Response(bytes))
never touches them.
Estimated impact if fixed. Constant per-response, on the order of a
few percent of the JS-side cost. Below the prompt's
"single-digit-percent" threshold individually, but worth mentioning because
it's a clear missing fast path. Listed as the lowest-ranked finding.

Other observations (not ranked — context only)

response_construct_string is 127 ns/op in Deno vs 5 ns/op in Bun. Bun
has an intrinsic Response-from-string path; matching that would require
changes at the V8/op boundary, not in JS. Worth investigating as part of
the broader cross-cutting "small new Uint8Array(N) allocation" finding
that the prior research tick traced to rusty-v8-147.4.0/.gn disabling
on-heap TypedArrays for embedder pointer stability — that finding will
live in its own cross-cutting writeup, but it is the underlying cause of
several of the smaller gaps observed here (e.g. the bigbody ratio is
partially driven by libc malloc/calloc time).
fetch client URL is parsed twice (ext/fetch/lib.rs:436
after the JS-side new URL() parse). Not a fetch-PR optimization — it
belongs in perf-research/url and is mentioned here only to note that
the architectural pattern of "JS parses, op re-parses" recurs across
multiple fetch surfaces (headers in H5, URL here).

Reproduction

See tools/perf_research/fetch/README.md.

cargo build --release --bin deno
cd tools/perf_research/fetch
bash run_micro.sh
bash run_servers.sh 10 64
bash run_v8_prof.sh
node analyze.js

The microbench script reports ns/op; run_servers.sh writes
results.jsonl; run_v8_prof.sh writes profiles/*.prof.txt. Pinned
versions live in versions.txt.

Cross-runtime HTTP servers, fetch client driver, and Headers/Request/Response microbenches for comparing Deno against Node 22 LTS and Bun 1.3.

…ess, results analyzer

- results.jsonl: wrk runs of hello/headers/echo/bigbody on Deno/Node/Bun - micro_results.jsonl: Headers + Request/Response per-op timings - profiles/*.prof.txt + *.v8.log.gz: V8 --prof attribution for the headers and request/response microbenches, plus a server-mode prof sampled at ~40k rps on GET /hello. - run_servers.sh: clean lingering listeners, fix POST body in wrk lua, parse Latency Distribution p99 correctly. - versions.txt: deno 2.7.14, node v22.22.2, bun 1.3.14.

… measurements Adds a quantitative comparison of TypedArrayPrototypeSlice (current extractBody defensive memcpy) against ArrayBuffer.prototype.transfer (proposed architectural replacement) on this host's Deno 2.8.0 release build, and a landing-path analysis for upstream. Key numbers (12 runs × 10k iters, median): - 1 MB: slice 559 us vs transfer 208 us — 2.69x speedup - 4 MB: slice 2.37 ms vs transfer 826 us — 2.87x speedup The ~350 us saved per 1 MB response is enough to close the parent report's /bigbody Deno-vs-Node rps gap (1246 -> ~2200 theoretical, within ~10 percent of Node 22). Recommended landing: opt-in ResponseInit.transfer flag, ~100-200 LoC, spec-compatible (additive), no behaviour change for code that does not opt in. Three other landing options ranked but rejected (Deno.serve auto-detection needs escape analysis; copy-in-Rust is same memcpy count; silent transfer is a breaking change).

crowlbot added 3 commits May 14, 2026 14:05

perf-research/fetch: bench scaffolding

5edd520

Cross-runtime HTTP servers, fetch client driver, and Headers/Request/Response microbenches for comparing Deno against Node 22 LTS and Bun 1.3.

perf-research/fetch: bigbody route, body throughput client, prof harn…

41abb13

…ess, results analyzer

This was referenced May 18, 2026

perf research: streams #4

Draft

perf research: structuredClone #5

Draft

perf research: crypto.subtle + getRandomValues #6

Draft

crowlbot mentioned this pull request May 23, 2026

perf research: ext/http (Deno.serve) #8

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf research: fetch#1

perf research: fetch#1
crowlbot wants to merge 4 commits into
mainfrom
perf-research/fetch

crowlbot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

crowlbot commented May 14, 2026

perf research: fetch

Methodology

Headline ratios — HTTP server (rps; higher is better)

Headline ratios — microbench (ns/op; lower is better)

Headers

Request / Response

Flamegraph attribution (V8 --prof)

Headers microbench top JS frames (34 551 ticks)

Server GET /hello top JS frames (2 619 ticks; Deno binary = 52 %, JS = 43 %)

Where the cost lives

Ranked architectural hypotheses

H1 — Headers storage as [[name, value], …] forces every op into an O(n) scan with two byteLowerCase calls per entry (HIGH × HIGH)

H2 — _iterableHeaders rebuilds and sorts the iteration view on every for…of for non-immutable Headers (HIGH × HIGH)

H3 — extractBody always copies the caller's Uint8Array / ArrayBuffer (HIGH × MEDIUM)

H4 — record<ByteString, ByteString> webidl converter is the leading JS-side cost of every new Response(_, { headers: {...} }) and new Headers({...}) (MEDIUM × HIGH)

H5 — InnerRequest.headerList re-pairs the op's flat [k,v,k,v,…] into N nested [name, value] tuples per request (MEDIUM × MEDIUM)

H6 — newInnerResponse allocates a 9-property object literal with two closures on every response, even for the fast-path bytes route (LOW × MEDIUM)

Other observations (not ranked — context only)

Reproduction

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Flamegraph attribution (V8 `--prof`)

Server `GET /hello` top JS frames (2 619 ticks; Deno binary = 52 %, JS = 43 %)

H1 — Headers storage as `[[name, value], …]` forces every op into an O(n) scan with two `byteLowerCase` calls per entry (HIGH × HIGH)

H2 — `_iterableHeaders` rebuilds and sorts the iteration view on every `for…of` for non-immutable Headers (HIGH × HIGH)

H3 — `extractBody` always copies the caller's `Uint8Array` / `ArrayBuffer` (HIGH × MEDIUM)

H4 — `record<ByteString, ByteString>` webidl converter is the leading JS-side cost of every `new Response(_, { headers: {...} })` and `new Headers({...})` (MEDIUM × HIGH)

H5 — `InnerRequest.headerList` re-pairs the op's flat `[k,v,k,v,…]` into N nested `[name, value]` tuples per request (MEDIUM × MEDIUM)

H6 — `newInnerResponse` allocates a 9-property object literal with two closures on every response, even for the fast-path bytes route (LOW × MEDIUM)