Skip to content

perf research: fetch#1

Draft
crowlbot wants to merge 4 commits into
mainfrom
perf-research/fetch
Draft

perf research: fetch#1
crowlbot wants to merge 4 commits into
mainfrom
perf-research/fetch

Conversation

@crowlbot
Copy link
Copy Markdown
Owner

perf research: fetch

Macro performance research on Deno's implementation of fetch / Request /
Response / Headers (client + server, body consumption, streaming bodies).

This PR contains only benchmark scripts and committed profile artifacts — no
production code changes. Bench scripts live under
tools/perf_research/fetch/; flamegraph-equivalent V8 prof attribution is
committed under tools/perf_research/fetch/profiles/. The report below is the
deliverable.

Methodology

  • Ratios over absolute numbers. Host is Docker on Proxmox; absolute rps is
    unreliable, so the headline is same-host ratios vs Node 22 LTS and Bun.
  • Flamegraph attribution. V8 --prof in-process. perf / samply need
    kernel.perf_event_paranoid<=1 but the container is locked at 3 and
    sysctl is denied even via sudo — so all attribution below is JS-side.
    Native-side attribution (hyper, op-dispatch, libc) is left unranked.
  • Realistic workloads. Concurrent HTTP server under wrk -c64 -t4 -d10s
    on four routes (hello / headers-count / echo-JSON / 1 MB body), plus
    microbenches over Headers / Request / Response ops on portable JS.

Pinned: deno 2.7.14 (v8 14.7.173.20-rusty), node v22.22.2, bun 1.3.14.

Headline ratios — HTTP server (rps; higher is better)

route Deno Node Bun Node/Deno Bun/Deno
GET /hello (≈25 B JSON) 39 775 32 226 69 190 0.81 1.74
GET /headers (count headers, tiny resp) 33 803 37 095 55 933 1.10 1.65
POST /echo (≈25 B JSON in, echoed) 30 408 18 456 51 144 0.61 1.68
GET /bigbody (1 MB fresh Uint8Array) 1 246 2 435 1 082 1.95 0.87

Reads:

  • On small request/response paths, Deno is competitive with Node, slightly
    ahead on the JSON echo and the hello route; Bun is ~65–75 % faster on the
    same hardware.
  • /bigbody is the outlier: Node is ~2× faster than Deno even though Deno
    beats Node on the tiny-body routes. The split shows up cleanly in the
    microbenches below.

Headline ratios — microbench (ns/op; lower is better)

Headers

op Deno Node Bun Deno/Node Deno/Bun
headers_construct_obj 4 245 3 691 1 282 1.15 3.31
headers_construct_arr 4 128 4 946 1 638 0.84 2.52
headers_get_hit 303 95 37 3.20 8.24
headers_get_miss 257 96 49 2.69 5.25
headers_set_fresh 179 271 337 0.66 0.53
headers_append_fresh 231 455 409 0.51 0.56
headers_iter (for…of over 12 entries) 28 996 398 1 991 72.9 14.6
headers_has_hit 141 98 30 1.45 4.67
headers_has_miss 245 93 47 2.63 5.24

Request / Response

op Deno Node Bun
response_construct_string 127 7 632 5
response_construct_u8 1 069 10 480 256
response_construct_with_headers ({headers:{...}}) 569 8 519 645
response_construct_reused_headers ({headers: hdr}) 1 385 8 539 441
request_construct 3 640 11 859 1 333
response_text_smallish 1 626 10 930 735
response_json_smallish 3 938 12 592 1 744

Reads:

  • Deno is 6–60× faster than Node on Response/Request construction (Node's
    Response is undici's pure-JS implementation; the comparison is asymmetric).
  • Against Bun, Deno is 2–25× slower on Response construction, and
    response_construct_with_headers is the only path where it is roughly on
    par with Bun (Deno 569 ns/op vs Bun 645 ns/op).
  • The Bun ↔ Deno gap on response_construct_reused_headers (3.1×) is
    surprising — passing an existing Headers instance is slower in Deno
    than passing an object literal. See hypothesis perf research: url #2 below.

Flamegraph attribution (V8 --prof)

Full profiles committed at:

Headers microbench top JS frames (34 551 ticks)

   ticks  total  nonlib   name
   3068    8.9%   12.4%   Builtin: StringToLowerCaseIntl          // byteLowerCase
   3047    8.8%   12.3%   Builtin: ArrayTimSort                   // _iterableHeaders sort
   1913    5.5%    7.7%   JS: *<anonymous> ext:deno_fetch/20_headers.js:272:7   // sort cb
   1808    5.2%    7.3%   Builtin: StringLessThan                 // inside the sort
   1581    4.6%    6.4%   Builtin: KeyedStoreIC_Megamorphic       // entries[i][...] writes
   1356    3.9%    5.5%   Builtin: KeyedLoadIC_Megamorphic        // entries[i][...] reads
    962    2.8%    3.9%   JS: *<anonymous> ext:deno_webidl/00_webidl.js:1101:10 // record<ByteString,ByteString>
    434    1.3%    1.8%   JS: *<anonymous> ext:deno_webidl/00_webidl.js:932:19  // isByteString

Bottom-up shows ~27 % of nonlib ticks land inside the ArrayPrototypeSort
call that lives in Headers[_iterableHeaders]'s getter
. The next ~12 %
lands in StringToLowerCaseIntl, attributable to byteLowerCase calls in
both the iter rebuild and the per-op linear scans of getHeader/has/set.

Server GET /hello top JS frames (2 619 ticks; Deno binary = 52 %, JS = 43 %)

     98    3.7%    8.6%   JS: *<anonymous> ext:deno_webidl/00_webidl.js:1101:10 // record<ByteString,ByteString>
     57    2.2%    5.0%   Builtin: KeyedStoreIC
     48    1.8%    4.2%   Builtin: LoadIC
     45    1.7%    4.0%   Builtin: CreateTypedArray              // new Uint8Array(...)
     41    1.6%    3.6%   Builtin: CreateShallowObjectLiteral    // newInnerResponse(), etc.
     34    1.3%    3.0%   Builtin: TypedArrayPrototypeSlice      // extractBody's copy
     11    0.4%    1.0%   JS: *get ext:deno_fetch/22_body.js:290:10  // bodyUsed getter

The leading JS hotspot under live HTTP load is the record<ByteString, ByteString> webidl converter — it runs on every
new Response(bytes, { headers: { "content-type": "…" } }).

Where the cost lives

Finding File:line Notes
Headers iter rebuilds + sorts every for-of (sort dominates the prof) ext/fetch/20_headers.js:227-284 _iterableHeaders getter only caches when guard === "immutable". Server Request.headers is "immutable" (ext/http/00_serve.ts:561), but most user-built Headers and Response.headers are not.
Headers .get/.has/.set/.delete/.append linear-scan + byteLowerCase per entry ext/fetch/20_headers.js:166-176, :381, :405, :321, :122 Storage is [[name, value], …] with case preserved. Every op does byteLowerCase(query) + byteLowerCase(entry[0]) per entry.
WebIDL ByteString converter runs isByteString (char-by-char loop) on every arg ext/webidl/00_webidl.js:426-447 Adds O(name.length) per headers.get("name") etc.; also runs for every kv pair in new Response(_, {headers: obj}).
extractBody slices ArrayBufferView / ArrayBuffer source on every new Response(bytes) ext/fetch/22_body.js:437-459 TypedArrayPrototypeSlice(object) makes a defensive copy even when the caller is going to discard the original. ~1 MB copy on /bigbody.
newInnerResponse allocates a 9-property object literal with closures on every Response ext/fetch/23_response.js:136-150 CreateShallowObjectLiteral shows up as 3.6 % of JS time under server load.
initializeAResponse does a linear scan to check for an existing Content-Type entry ext/fetch/23_response.js:218-230 Pays the case-mismatched linear scan even when the headers were just built fresh and are known not to contain Content-Type.
InnerRequest.headerList allocates N 2-tuples by ArrayPrototypePush per request ext/http/00_serve.ts:405-415 Op returns flat [k,v,k,v,…]; JS re-pairs into nested arrays before exposing as Headers.
op_fetch re-parses Vec<(ByteString, ByteString)> headers to HeaderName/HeaderValue per send ext/fetch/lib.rs:513-520 JS already iterates Headers to flatten before the op call; Rust then iterates again.

Ranked architectural hypotheses

Each entry: impact estimate × confidence based on profile attribution, the
ratio gap, and how commonly the code path appears in real-world fetch use.
"Single-digit-percent" wins are deliberately omitted per the prompt's scope.

H1 — Headers storage as [[name, value], …] forces every op into an O(n) scan with two byteLowerCase calls per entry (HIGH × HIGH)

  • Evidence. .get_hit 3.2× Node / 8.2× Bun. .has_miss 2.6× Node /
    5.2× Bun. Bottom-up: 12.4 % nonlib in StringToLowerCaseIntl, attributable
    almost entirely to the byteLowerCase calls in getHeader, has, set,
    delete, appendHeader.
  • Architectural root. Headers stores entries case-preserved and
    un-indexed (ext/fetch/20_headers.js:223). Every op normalizes both the
    query and each entry's name on every call. Bun stores Headers in a
    hashmap with lowercase keys; undici stores entries with their lowercase
    index pre-computed.
  • Estimated impact if fixed. Per-op cost dominated by the scan + the two
    byteLowerCase calls; a hashmap or pre-lowercased index keyed by interned
    lowercase names would cut .get/.has to constant time and drop the
    StringToLowerCaseIntl ticks to near zero. For the server hot path
    (Deno.serve handlers reading req.headers.get("x")), this is the leading
    per-request JS cost after webidl conversion. Conservative: 2-3× on
    Headers-heavy handlers.

H2 — _iterableHeaders rebuilds and sorts the iteration view on every for…of for non-immutable Headers (HIGH × HIGH)

  • Evidence. headers_iter 73× Node / 15× Bun. Bottom-up: 8.8 % in
    ArrayTimSort, 5.5 % in the sort callback at 20_headers.js:272, 5.2 % in
    StringLessThan, 4.6 % in KeyedStoreIC building the entries array.
  • Architectural root. Cache at
    ext/fetch/20_headers.js:227-284
    only triggers when guard === "immutable". Response.headers is "response",
    user-built new Headers(...) is "none", and any mutable Request created
    in JS is "request" — all of which fall through and rebuild + sort on
    every for…of and on every internal headersEntries(...) call (which
    fans out into JSON.stringify(req.headers) in user code, async-iterator
    protocol of Response headers, OpenTelemetry propagator extraction at
    ext/http/00_serve.ts:667-682, etc.).
  • Estimated impact if fixed. Invalidating the cache only on mutations
    (append/set/delete) and reusing it across reads would land the
    per-iteration cost in the sub-microsecond range Node and Bun achieve.
    Removes the ArrayTimSort + StringLessThan + KeyedStoreIC entries from the
    prof entirely. Plausible 10–50× on the iter microbench; impact on
    end-to-end server rps is smaller because Deno's HTTP path uses internal
    flat-array access (inner.headerList), but every userland iteration on a
    response-builder pattern (for (const [k, v] of someHeaders)) is on this
    path.

H3 — extractBody always copies the caller's Uint8Array / ArrayBuffer (HIGH × MEDIUM)

  • Evidence. /bigbody (1 MB GET): Node 2.4 k rps, Deno 1.2 k rps —
    1.95× gap even though Deno wins on every smaller route. Server prof
    shows 3.0 % nonlib in TypedArrayPrototypeSlice and 4.0 % in
    CreateTypedArray.
  • Architectural root.
    ext/fetch/22_body.js:437-459 ends each
    ArrayBufferView/ArrayBuffer branch with source = TypedArrayPrototypeSlice( object). This is a defensive copy — the bytes are written back through
    op_http_set_response_body_bytes, but the slice happens unconditionally
    even when the caller is a Deno.serve handler that allocated the buffer
    inline and immediately drops it.
  • Estimated impact if fixed. For 1 MB responses, eliminating one full
    copy doubles the byte-throughput ceiling, which is exactly the 2× gap
    observed against Node. For typical (<8 KB) responses the cost is in the
    noise. Closes the /bigbody gap, but not the /hello gap.

H4 — record<ByteString, ByteString> webidl converter is the leading JS-side cost of every new Response(_, { headers: {...} }) and new Headers({...}) (MEDIUM × HIGH)

  • Evidence. Server GET /hello prof: 8.6 % nonlib in
    ext/webidl/00_webidl.js:1101 (createRecordConverter). This converter
    runs every time a handler returns new Response("…", { headers: { "content-type": "…" } }).
  • Architectural root. Each kv pair pays:
    webidl.converters.ByteString(key)DOMString round-trip + isByteString
    char loop, then the same for the value, then appendHeader which itself
    paths through H1. The {headers: hdr} form (passing an existing Headers
    instance) is 3.14× slower than Bun in the microbench
    (response_construct_reused_headers 1 385 ns/op vs Bun 441 ns/op),
    because it serializes-then-deserializes through the same path
    instead of using a shared internal pointer.
  • Estimated impact if fixed. A typed fast path for
    {headers: {"content-type": "<known-token>"}} (the literal-string init
    pattern produced by virtually every handler) and a true zero-copy reuse
    for {headers: existingHeadersInstance} would knock 5-8 % of total CPU off
    the server hot path. Smaller per call than H1/H2, but pervasive.

H5 — InnerRequest.headerList re-pairs the op's flat [k,v,k,v,…] into N nested [name, value] tuples per request (MEDIUM × MEDIUM)

  • Evidence. Server prof: 5.0 % KeyedStoreIC + 4.2 % LoadIC are not
    individually huge, but ext/http/00_serve.ts:405-415 runs once per
    request that calls req.headers. With N=10 typical request headers
    that's 10 fresh 2-element arrays + the Headers wrapper + the
    case-preserving entries machinery downstream.
  • Architectural root. The Rust op
    (ext/http/http_next.rs:466-533) returns
    a flat string array specifically to be cheap on the V8 side; the JS layer
    then pairs them up for _headerList storage, defeating the flat-array
    efficiency. A headerList that internally remained a flat array (or that
    was lazily indexed) would avoid the per-request N allocations.
  • Estimated impact if fixed. Per request: removes N (≈10) tuple
    allocations + GC pressure. Compounds with H1 because every subsequent
    headers.get() then walks the nested representation.

H6 — newInnerResponse allocates a 9-property object literal with two closures on every response, even for the fast-path bytes route (LOW × MEDIUM)

  • Evidence. Server prof: 3.6 % nonlib in CreateShallowObjectLiteral.
    Smaller than H1-H4 but recurs at every response.
  • Architectural root. ext/fetch/23_response.js:136-150.
    The closures (url(), etc.) are needed for spec-level lazy URL list
    handling, but the fast path (server returning a fresh new Response(bytes))
    never touches them.
  • Estimated impact if fixed. Constant per-response, on the order of a
    few percent of the JS-side cost. Below the prompt's
    "single-digit-percent" threshold individually, but worth mentioning because
    it's a clear missing fast path. Listed as the lowest-ranked finding.

Other observations (not ranked — context only)

  • response_construct_string is 127 ns/op in Deno vs 5 ns/op in Bun. Bun
    has an intrinsic Response-from-string path; matching that would require
    changes at the V8/op boundary, not in JS. Worth investigating as part of
    the broader cross-cutting "small new Uint8Array(N) allocation" finding
    that the prior research tick traced to rusty-v8-147.4.0/.gn disabling
    on-heap TypedArrays for embedder pointer stability — that finding will
    live in its own cross-cutting writeup, but it is the underlying cause of
    several of the smaller gaps observed here (e.g. the bigbody ratio is
    partially driven by libc malloc/calloc time).
  • fetch client URL is parsed twice (ext/fetch/lib.rs:436
    after the JS-side new URL() parse). Not a fetch-PR optimization — it
    belongs in perf-research/url and is mentioned here only to note that
    the architectural pattern of "JS parses, op re-parses" recurs across
    multiple fetch surfaces (headers in H5, URL here).

Reproduction

See tools/perf_research/fetch/README.md.

cargo build --release --bin deno
cd tools/perf_research/fetch
bash run_micro.sh
bash run_servers.sh 10 64
bash run_v8_prof.sh
node analyze.js

The microbench script reports ns/op; run_servers.sh writes
results.jsonl; run_v8_prof.sh writes profiles/*.prof.txt. Pinned
versions live in versions.txt.

crowlbot added 3 commits May 14, 2026 14:05
Cross-runtime HTTP servers, fetch client driver, and Headers/Request/Response
microbenches for comparing Deno against Node 22 LTS and Bun 1.3.
- results.jsonl: wrk runs of hello/headers/echo/bigbody on Deno/Node/Bun
- micro_results.jsonl: Headers + Request/Response per-op timings
- profiles/*.prof.txt + *.v8.log.gz: V8 --prof attribution for the
  headers and request/response microbenches, plus a server-mode prof
  sampled at ~40k rps on GET /hello.
- run_servers.sh: clean lingering listeners, fix POST body in wrk lua,
  parse Latency Distribution p99 correctly.
- versions.txt: deno 2.7.14, node v22.22.2, bun 1.3.14.
… measurements

Adds a quantitative comparison of TypedArrayPrototypeSlice (current
extractBody defensive memcpy) against ArrayBuffer.prototype.transfer
(proposed architectural replacement) on this host's Deno 2.8.0
release build, and a landing-path analysis for upstream.

Key numbers (12 runs × 10k iters, median):
- 1 MB: slice 559 us vs transfer 208 us — 2.69x speedup
- 4 MB: slice 2.37 ms vs transfer 826 us — 2.87x speedup

The ~350 us saved per 1 MB response is enough to close the parent
report's /bigbody Deno-vs-Node rps gap (1246 -> ~2200 theoretical,
within ~10 percent of Node 22).

Recommended landing: opt-in ResponseInit.transfer flag, ~100-200 LoC,
spec-compatible (additive), no behaviour change for code that does
not opt in. Three other landing options ranked but rejected (Deno.serve
auto-detection needs escape analysis; copy-in-Rust is same memcpy
count; silent transfer is a breaking change).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant