Skip to content

perf research: streams#4

Draft
crowlbot wants to merge 2 commits into
mainfrom
perf-research/streams
Draft

perf research: streams#4
crowlbot wants to merge 2 commits into
mainfrom
perf-research/streams

Conversation

@crowlbot
Copy link
Copy Markdown
Owner

TL;DR

No high-impact architectural slowdown found in the streams machinery. Across the macro and microbench, Deno's ReadableStream/WritableStream/TransformStream either beat Node 22 LTS / Node 23 / Bun, or are within ~1.5× on small-construction microbenches.

This PR exists to record the negative result so a future session does not re-investigate.

Headline ratios

16 MB body through pipeThrough(TransformStream uppercase) → pipeTo(sink), 20 iters

Runtime MB/s ratio vs Deno
Deno 2.7.14 440.7 1.00×
Bun 1.1.43 296.8 Deno 1.48× faster
Node 23.7.0 243.2 Deno 1.81× faster
Node 22.13.1 221.3 Deno 1.99× faster

Microbench excerpts (ns/op, lower is better)

bench Deno Node 22 Bun
ts_construct 7,277 23,629 9,492
rs_read_256x4k 173,641 328,503 155,234
rs_pipethrough_identity_256x4k 1,096,835 1,967,810 1,480,243
rs_pipeto_sink_256x4k 743,680 1,169,643 769,557

Full results are in tools/perf_research/streams/profiles/streams_*.json. See tools/perf_research/streams/README.md for the full report (hypotheses considered and ruled out, V8 prof attribution).

Where the time goes

V8 prof on the macro bench (tools/perf_research/streams/profiles/streams_macro.prof.txt):

  • ~50 % of total ticks land in the user transform function (per-byte uppercase loop + new Uint8Array(N)).
  • ~24 % in shared libraries (libc malloc / deno binary — same allocation tail).
  • ~5 % of nonlib ticks in the streams machinery itself (writeAlgorithm, chunkSteps, transformAlgorithm).

The user-allocation tail (new Uint8Array(N) hitting libc malloc) is the same cross-cutting cost already documented in PR #1 (fetch, H3) and PR #3 (text-encoding, H1). Root cause is v8_typed_array_max_size_in_heap = 0 in rusty-v8 — out of streams scope.

Hypotheses considered and ruled out

# Hypothesis Verdict
H1 TransformStream per-chunk promise chain too deep Rejected — 2× faster than Node on macro
H2 pipeThrough copies chunks at the boundary Rejected — identity pipeThrough is 1.79× faster than Node
H3 BYOB read path is slow Unranked — bench denoland#8 hangs on Deno and Bun (correctness, not perf)
H4 tee'd reads scale badly Rejected (informally) — macro pipeline wins

What's not here

  • Native flamegraph attribution. kernel.perf_event_paranoid = 4 on the bench host and sudo is unavailable, so perf / samply are blocked. JS attribution via V8 --prof is what's in profiles/streams_macro.prof.txt. This same constraint applied to PRs perf research: fetch #1perf research: text-encoding #3 on this fork.
  • Upstream PR. No graduated upstream fix from this surface — there is nothing landable.

Layout

tools/perf_research/streams/
  README.md                                       full report
  micro/streams_micro.js                          10 ops
  micro/streams_macro.js                          16 MB pipeThrough macro
  profiles/streams_{deno,node22,node23,bun}.json  raw bench output per runtime
  profiles/streams_macro.prof.txt                 V8 --prof process output
  profiles/versions.txt                           runtime versions + host caps

crowlbot added 2 commits May 14, 2026 14:52
Microbench covers ReadableStream construct/read, TransformStream identity
+ copy pipeThrough, WritableStream pipeTo sink, BYOB read, async iter,
and tee. Benches were not run before the session was transferred — the
next worker should run the micro across deno/node/bun, capture a V8 prof,
and write up the report per the pattern used in perf-research/fetch (PR 1),
perf-research/url (PR 2), and perf-research/text-encoding (PR 3) on the
crowlbot/deno fork.
16 MB pipeThrough macro: Deno 440.7 MB/s vs Bun 296.8 / Node 22.13 221.3 /
Node 23.7 243.2. Microbench: Deno is faster than Node on every bench that
completed, within 1.5x of Bun on construction microbenches and ahead of
both Node and Bun on pipeThrough/pipeTo.

V8 prof on the macro bench attributes ~50 percent of ticks to user
transform code (per-byte loop + new Uint8Array(N)), ~24 percent to
shared libraries (libc malloc / deno binary, mostly the same allocation
tail), and only ~5 percent of nonlib ticks to streams machinery itself.
No high-impact streams architectural slowdown to attack.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant