Skip to content

perf research: text-encoding#3

Draft
crowlbot wants to merge 1 commit into
mainfrom
perf-research/text-encoding
Draft

perf research: text-encoding#3
crowlbot wants to merge 1 commit into
mainfrom
perf-research/text-encoding

Conversation

@crowlbot
Copy link
Copy Markdown
Owner

perf research: text-encoding

Macro performance research on Deno's implementation of TextEncoder and
TextDecoder (including encodeInto and stream-mode decoding).

This PR contains only benchmark scripts and committed V8 prof artifacts — no
production code changes. The report below is the deliverable.

Methodology

  • Ratios over absolute numbers. Host is Docker on Proxmox; absolute
    ns/op is unreliable, so the headline is same-host ratios vs Node 22 LTS
    and Bun.
  • Flamegraph attribution. V8 --prof in-process. perf / samply
    need kernel.perf_event_paranoid<=1 but the container is locked at
    3 and sysctl is denied. JS-side attribution is comprehensive;
    native-side time is bucketed under "deno binary" and "libc.so.6".
  • Microbench mix. 14 ops covering encode (tiny/small/medium ASCII +
    UTF-8 mixed), encodeInto, decode at four ASCII sizes (tiny/small/medium/
    large 1 MB) + UTF-8 mixed, stream-mode decode (5 chunks), and
    encoder/decoder construction.

Pinned: deno 2.7.14 (v8 14.7.173.20-rusty), node v22.22.2, bun 1.3.14.

Headline ratios — microbench (ns/op; lower is better)

op Deno Node Bun Deno/Node Deno/Bun
encode("hello world!") (12 B) 1 324 1 082 65 1.22 20.21
encode("Content-Type: application/json") (30 B) 1 512 970 44 1.56 34.36
encode("x".repeat(1000)) (1 KB) 2 335 2 132 386 1.10 6.05
encode(utf8mixed) (~50 B mixed) 999 1 078 62 0.93 16.01
encodeInto(small, dest) 40 53 42 0.76 0.96
decode(12 B ASCII) 118 77 69 1.54 1.72
decode(30 B ASCII) 115 76 85 1.51 1.36
decode(1 KB ASCII) 190 193 669 0.99 0.28
decode(1 MB ASCII) 93 206 610 261 117 139 0.15 0.80
decode(utf8mixed) (~50 B) 286 217 205 1.32 1.40
decode(utf8medium mixed) (~1 KB) 3 300 3 029 1 316 1.09 2.51
decode (5 stream:true chunks of 30 B) 2 551 1 241 651 2.06 3.92
new TextEncoder() 6.3 6.2 5.0 1.02 1.27
new TextDecoder() 187 75 6.6 2.49 28.32

Reads:

  • Deno is world-class fast on decode(1 MB ASCII) — 93 µs vs Node's
    610 µs (6.5× faster), thanks to the simdutf::validate_ascii short
    circuit at ext/web/lib.rs:528-535 and the
    matching path in core's op_decode.
    This is one of the rare wins documented here; flag for the team.
  • TextEncoder.encode("…") on small inputs is 20-34× slower than Bun.
    Bun has an intrinsic Response/encode path; Deno op-dispatches into
    op_encode which always allocates a fresh shared backing store for the
    output Uint8Array.
  • new TextDecoder() is 28× slower than Bun (187 ns vs 7 ns). Every
    construction op-dispatches through op_encoding_new_decoder to obtain a
    cppgc resource, even for the default "utf-8" case.
  • Stream-mode decode(chunk, { stream: true }) is 2× slower than Node,
    3.9× slower than Bun.
    Each chunk is an op call that returns a fresh
    U16String (UTF-16 vector) marshaled across the boundary.
  • Single decoder ops on small inputs are 1.5× Nodedecode_tiny_ascii
    118 ns vs 77 ns. The simdutf fast path lives on the ASCII branch, but
    there's still op-dispatch overhead per call.

Flamegraph attribution (V8 --prof)

Full profile:
tools/perf_research/text-encoding/profiles/text_encoding_micro.prof.txt
(raw log: text_encoding_micro.v8.log.gz).

Top of Statistical profiling result (1 740 ticks):

   ticks  total  nonlib   name
   1011   58.1%          /var/agent-loop/repo/target/release/deno
    426   24.5%          /usr/lib/x86_64-linux-gnu/libc.so.6

 [C++]:
   ticks  total  nonlib   name
     64    3.7%   21.3%   __libc_malloc@@GLIBC_2.2.5
     32    1.8%   10.7%   __libc_realloc@@GLIBC_2.2.5
     23    1.3%    7.7%   __lll_lock_wake_private@@GLIBC_PRIVATE
     14    0.8%    4.7%   __lll_lock_wait_private@@GLIBC_PRIVATE
      6    0.3%    2.0%   __libc_free@@GLIBC_2.2.5

 [JavaScript]:
   ticks  total  nonlib   name
     19    1.1%    6.3%   Builtin: LoadIC
      8    0.5%    2.7%   Builtin: CallApiCallbackOptimizedNoProfiling
      8    0.5%    2.7%   Builtin: Call_ReceiverIsNullOrUndefined
      ...

~32 % of nonlib time is in libc_malloc + libc_realloc. That is the
backing-store allocator path firing on every new Uint8Array(N) created
inside op_encode. JS-side cost is small (<10 % combined); essentially
all time is in the native encode path and its allocator.

The __lll_lock_wake_private / __lll_lock_wait_private (12.4 % combined)
indicates contention on the libc malloc arena lock — small-buffer
allocations going through a global heap lock.

Where the cost lives

Finding File:line Notes
op_encode always allocates a fresh shared backing store via v8::ArrayBuffer::new_backing_store_from_vec libs/core/ops_builtin_v8.rs:517-531 Every TextEncoder.encode("…") (and every core.encode(…) call from fetch body/formdata/22_body.js) ends here. Backing store creation goes through V8's ArrayBuffer::Allocator, which in rusty-v8 cannot use the on-heap-TypedArray fast path (v8_typed_array_max_size_in_heap = 0 in rusty-v8-147.4.0/.gn).
new TextDecoder() op-dispatches even for the default "utf-8" case ext/web/08_text_encoding.js:85, :213 Construction calls op_encoding_normalize_label(label); non-utf8 case additionally calls op_encoding_new_decoder to obtain a cppgc resource. The default UTF-8 path also goes through normalize-label.
decode(chunk, {stream:true}) op-dispatches per chunk and returns a U16String (UTF-16 vec) ext/web/08_text_encoding.js:219, ext/web/lib.rs:631-669 Each op allocates vec![0; max_utf16_buffer_length] for the output, then marshals as U16String back across the boundary. The static (non-stream) UTF-8 path has the simdutf ASCII fast path; the streaming/non-UTF-8 path does not.
Encoder/encode path does not have an on-heap fast path for small outputs libs/core/ops_builtin_v8.rs:526-528 Compare to Bun's intrinsic encode path which is ~20× faster on 12-byte strings. The on-heap-TypedArray V8 default is 64 bytes; disabling it (for ArrayBuffer pointer stability in rusty-v8) forces every encode through the malloc-backed allocator.
op_encoding_decode_utf8 has an excellent SIMD ASCII fast path — keep this ext/web/lib.rs:518-535 This is why decode(1 MB ASCII) is 6.5× faster than Node. Documented here as a positive architectural choice, not a finding to fix.

Ranked architectural hypotheses

H1 — TextEncoder.encode allocates a fresh malloc-backed ArrayBuffer on every call; on-heap-TypedArray fast path is disabled (HIGH × HIGH)

  • Evidence. encode("hello world!") (12 B) is 1 324 ns in Deno vs
    65 ns in Bun — 20× gap. V8 prof: ~32 % of nonlib ticks in
    __libc_malloc + __libc_realloc, plus 12 % in malloc arena locks.
  • Architectural root. libs/core/ops_builtin_v8.rs:517-531
    always calls v8::ArrayBuffer::new_backing_store_from_vec(bytes).make_shared(),
    which in rusty-v8 goes through the libc allocator (V8's on-heap
    TypedArray path is disabled at v8_typed_array_max_size_in_heap = 0 in
    the rusty-v8 .gn, to preserve embedder pointer stability for op2's
    #[buffer] ABI).
  • Estimated impact if fixed. Two complementary options:
    • (A) Re-enable on-heap TypedArrays for buffers ≤64 bytes and audit
      op2 #[buffer] callsites for NoAllocScope. Drops the 20× small-encode
      gap to near-zero.
    • (B) Implement a small-object backing store pool inside the
      embedder ArrayBuffer::Allocator. Less invasive than (A) but doesn't
      solve the cppgc-pointer-stability concern fully.
      Either way: closes the encode gap, drops libc time below 5 %. Cross-cutting
      with perf-research/fetch (every server response body), perf-research/streams,
      perf-research/structuredClone, and several other surfaces — this is the
      single highest-leverage finding across the whole research effort.

H2 — new TextDecoder() op-dispatches even for the default UTF-8 case (MEDIUM × HIGH)

  • Evidence. new TextDecoder() is 187 ns in Deno vs 75 ns in Node and
    7 ns in Bun (28× Bun, 2.5× Node).
  • Architectural root. Constructor at
    ext/web/08_text_encoding.js:85 calls
    op_encoding_normalize_label(label) unconditionally; later code (line
    213) sets up a cppgc resource only for non-UTF-8 cases. The default
    UTF-8 path still pays the label-normalize op. Bun's TextDecoder
    presumably elides this entirely on the default-utf8 path.
  • Estimated impact if fixed. Skip the label normalize op for the
    common case of new TextDecoder() (no args) and new TextDecoder("utf-8").
    Brings construct to ~10 ns (just the brand assignment). Affects every
    fetch handler that constructs a TextDecoder per request, every streaming
    reader, etc. Not a leading cost individually but very common.

H3 — decode(chunk, {stream:true}) allocates vec![0; max_utf16_len] per chunk and returns marshaled U16String (MEDIUM × HIGH)

  • Evidence. decode_stream_5chunks is 2.06× Node and 3.92× Bun.
  • Architectural root. ext/web/lib.rs:631-669
    pre-allocates a UTF-16 vector sized at max_utf16_buffer_length(data.len())
    then truncates. Returns U16String which is then marshaled back as a
    V8 string. Plus the cppgc resource borrow on every call.
  • Estimated impact if fixed. For ASCII-only chunks (the common case
    for HTTP body streaming), follow the simdutf ASCII fast path that the
    non-stream UTF-8 path already has — write straight to a V8 one-byte
    string and skip the UTF-16 vector. Brings stream decode in line with
    non-stream decode (~118 ns/chunk instead of 510 ns/chunk in this
    bench). For binary or non-ASCII streams, the existing path is fine.
    Affects any code that uses response.body + TextDecoderStream (a
    very common SSE / line-streamed-JSON pattern).

H4 — op_encoding_encode_into already has a fast path; encode does not (MEDIUM × MEDIUM)

  • Evidence. encodeInto(small, dest) is 40 ns in Deno (faster than
    Node's 53 ns, matches Bun's 42 ns). But encode(small) is 1 512 ns in
    Deno (20-30× the encodeInto cost on the same input).
  • Architectural root. encodeInto writes to a caller-provided buffer
    (ext/web/lib.rs:686-704) — no allocation,
    no backing-store creation. encode always allocates. The asymmetry
    is large.
  • Estimated impact if fixed. The fix is H1's small-buffer pool /
    on-heap path. A workaround would be to expose a fast path in
    op_encode that uses the same write-into-caller-buffer pattern when
    the encoded size is bounded (e.g. ≤256 bytes) — encode into a stack
    buffer first, then a single ArrayBuffer creation with the exact size.
    Listed separately from H1 because the encode-into infrastructure already
    exists and could be reused.

Non-finding — decode(1 MB ASCII) is 6.5× faster than Node (architectural win)

Documented here so the team doesn't accidentally undo it:
ext/web/lib.rs:528-535 and the matching
libs/core/ops_builtin_v8.rs:545-551
both short-circuit through v8::simdutf::validate_ascii
v8::String::new_from_one_byte. Pure ASCII (the dominant real-world
shape: HTTP/JSON bodies, file reads, console output) skips V8's internal
UTF-8 validation entirely. This single decision puts Deno's bulk-ASCII
decode well ahead of every other runtime measured.

Reproduction

See tools/perf_research/text-encoding/README.md.

cargo build --release --bin deno
./target/release/deno run -A --no-prompt tools/perf_research/text-encoding/micro/text_encoding_micro.js
node tools/perf_research/text-encoding/micro/text_encoding_micro.js
bun  tools/perf_research/text-encoding/micro/text_encoding_micro.js

…8 prof

- micro/text_encoding_micro.js: 14 ops covering encode (tiny/small/medium
  ASCII + UTF-8 mixed), encodeInto, decode (tiny/small/medium/large ASCII +
  UTF-8 mixed), stream-mode decode, encoder/decoder construct.
- micro_results.jsonl: Deno 2.7.14 / Node v22.22.2 / Bun 1.3.14 numbers.
- profiles/text_encoding_micro.prof.txt + .v8.log.gz: V8 --prof attribution
  showing libc malloc/realloc dominate the encode path (24.5 % of total
  ticks in libc, ~32 % nonlib in malloc/realloc) — confirms architectural
  cost of fresh backing stores per encode call.
@crowlbot crowlbot mentioned this pull request May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant