perf research: text-encoding by crowlbot · Pull Request #3 · crowlbot/deno

crowlbot · 2026-05-14T14:47:21Z

perf research: text-encoding

Macro performance research on Deno's implementation of TextEncoder and
TextDecoder (including encodeInto and stream-mode decoding).

This PR contains only benchmark scripts and committed V8 prof artifacts — no
production code changes. The report below is the deliverable.

Methodology

Ratios over absolute numbers. Host is Docker on Proxmox; absolute
ns/op is unreliable, so the headline is same-host ratios vs Node 22 LTS
and Bun.
Flamegraph attribution. V8 --prof in-process. perf / samply
need kernel.perf_event_paranoid<=1 but the container is locked at
3 and sysctl is denied. JS-side attribution is comprehensive;
native-side time is bucketed under "deno binary" and "libc.so.6".
Microbench mix. 14 ops covering encode (tiny/small/medium ASCII +
UTF-8 mixed), encodeInto, decode at four ASCII sizes (tiny/small/medium/
large 1 MB) + UTF-8 mixed, stream-mode decode (5 chunks), and
encoder/decoder construction.

Pinned: deno 2.7.14 (v8 14.7.173.20-rusty), node v22.22.2, bun 1.3.14.

Headline ratios — microbench (ns/op; lower is better)

op	Deno	Node	Bun	Deno/Node	Deno/Bun
`encode("hello world!")` (12 B)	1 324	1 082	65	1.22	20.21
`encode("Content-Type: application/json")` (30 B)	1 512	970	44	1.56	34.36
`encode("x".repeat(1000))` (1 KB)	2 335	2 132	386	1.10	6.05
`encode(utf8mixed)` (~50 B mixed)	999	1 078	62	0.93	16.01
`encodeInto(small, dest)`	40	53	42	0.76	0.96
`decode(12 B ASCII)`	118	77	69	1.54	1.72
`decode(30 B ASCII)`	115	76	85	1.51	1.36
`decode(1 KB ASCII)`	190	193	669	0.99	0.28
`decode(1 MB ASCII)`	93 206	610 261	117 139	0.15	0.80
`decode(utf8mixed)` (~50 B)	286	217	205	1.32	1.40
`decode(utf8medium mixed)` (~1 KB)	3 300	3 029	1 316	1.09	2.51
`decode (5 stream:true chunks of 30 B)`	2 551	1 241	651	2.06	3.92
`new TextEncoder()`	6.3	6.2	5.0	1.02	1.27
`new TextDecoder()`	187	75	6.6	2.49	28.32

Reads:

Deno is world-class fast on decode(1 MB ASCII) — 93 µs vs Node's
610 µs (6.5× faster), thanks to the simdutf::validate_ascii short
circuit at ext/web/lib.rs:528-535 and the
matching path in core's op_decode.
This is one of the rare wins documented here; flag for the team.
TextEncoder.encode("…") on small inputs is 20-34× slower than Bun.
Bun has an intrinsic Response/encode path; Deno op-dispatches into
op_encode which always allocates a fresh shared backing store for the
output Uint8Array.
new TextDecoder() is 28× slower than Bun (187 ns vs 7 ns). Every
construction op-dispatches through op_encoding_new_decoder to obtain a
cppgc resource, even for the default "utf-8" case.
Stream-mode decode(chunk, { stream: true }) is 2× slower than Node,
3.9× slower than Bun. Each chunk is an op call that returns a fresh
U16String (UTF-16 vector) marshaled across the boundary.
Single decoder ops on small inputs are 1.5× Node — decode_tiny_ascii
118 ns vs 77 ns. The simdutf fast path lives on the ASCII branch, but
there's still op-dispatch overhead per call.

Flamegraph attribution (V8 `--prof`)

Full profile:
tools/perf_research/text-encoding/profiles/text_encoding_micro.prof.txt
(raw log: text_encoding_micro.v8.log.gz).

Top of Statistical profiling result (1 740 ticks):

   ticks  total  nonlib   name
   1011   58.1%          /var/agent-loop/repo/target/release/deno
    426   24.5%          /usr/lib/x86_64-linux-gnu/libc.so.6

 [C++]:
   ticks  total  nonlib   name
     64    3.7%   21.3%   __libc_malloc@@GLIBC_2.2.5
     32    1.8%   10.7%   __libc_realloc@@GLIBC_2.2.5
     23    1.3%    7.7%   __lll_lock_wake_private@@GLIBC_PRIVATE
     14    0.8%    4.7%   __lll_lock_wait_private@@GLIBC_PRIVATE
      6    0.3%    2.0%   __libc_free@@GLIBC_2.2.5

 [JavaScript]:
   ticks  total  nonlib   name
     19    1.1%    6.3%   Builtin: LoadIC
      8    0.5%    2.7%   Builtin: CallApiCallbackOptimizedNoProfiling
      8    0.5%    2.7%   Builtin: Call_ReceiverIsNullOrUndefined
      ...

~32 % of nonlib time is in libc_malloc + libc_realloc. That is the
backing-store allocator path firing on every new Uint8Array(N) created
inside op_encode. JS-side cost is small (<10 % combined); essentially
all time is in the native encode path and its allocator.

The __lll_lock_wake_private / __lll_lock_wait_private (12.4 % combined)
indicates contention on the libc malloc arena lock — small-buffer
allocations going through a global heap lock.

Where the cost lives

Finding	File:line	Notes
`op_encode` always allocates a fresh shared backing store via `v8::ArrayBuffer::new_backing_store_from_vec`	`libs/core/ops_builtin_v8.rs:517-531`	Every `TextEncoder.encode("…")` (and every `core.encode(…)` call from fetch body/formdata/22_body.js) ends here. Backing store creation goes through V8's `ArrayBuffer::Allocator`, which in rusty-v8 cannot use the on-heap-TypedArray fast path (`v8_typed_array_max_size_in_heap = 0` in `rusty-v8-147.4.0/.gn`).
`new TextDecoder()` op-dispatches even for the default `"utf-8"` case	`ext/web/08_text_encoding.js:85`, `:213`	Construction calls `op_encoding_normalize_label(label)`; non-utf8 case additionally calls `op_encoding_new_decoder` to obtain a cppgc resource. The default UTF-8 path also goes through normalize-label.
`decode(chunk, {stream:true})` op-dispatches per chunk and returns a `U16String` (UTF-16 vec)	`ext/web/08_text_encoding.js:219`, `ext/web/lib.rs:631-669`	Each op allocates `vec![0; max_utf16_buffer_length]` for the output, then marshals as `U16String` back across the boundary. The static (non-stream) UTF-8 path has the `simdutf` ASCII fast path; the streaming/non-UTF-8 path does not.
Encoder/encode path does not have an on-heap fast path for small outputs	`libs/core/ops_builtin_v8.rs:526-528`	Compare to Bun's intrinsic encode path which is ~20× faster on 12-byte strings. The on-heap-TypedArray V8 default is 64 bytes; disabling it (for ArrayBuffer pointer stability in rusty-v8) forces every encode through the malloc-backed allocator.
`op_encoding_decode_utf8` has an excellent SIMD ASCII fast path — keep this	`ext/web/lib.rs:518-535`	This is why `decode(1 MB ASCII)` is 6.5× faster than Node. Documented here as a positive architectural choice, not a finding to fix.

Ranked architectural hypotheses

H1 — `TextEncoder.encode` allocates a fresh malloc-backed `ArrayBuffer` on every call; on-heap-TypedArray fast path is disabled (HIGH × HIGH)

Evidence. encode("hello world!") (12 B) is 1 324 ns in Deno vs
65 ns in Bun — 20× gap. V8 prof: ~32 % of nonlib ticks in
__libc_malloc + __libc_realloc, plus 12 % in malloc arena locks.
Architectural root. libs/core/ops_builtin_v8.rs:517-531
always calls v8::ArrayBuffer::new_backing_store_from_vec(bytes).make_shared(),
which in rusty-v8 goes through the libc allocator (V8's on-heap
TypedArray path is disabled at v8_typed_array_max_size_in_heap = 0 in
the rusty-v8 .gn, to preserve embedder pointer stability for op2's
#[buffer] ABI).
Estimated impact if fixed. Two complementary options:
- (A) Re-enable on-heap TypedArrays for buffers ≤64 bytes and audit
  op2 #[buffer] callsites for NoAllocScope. Drops the 20× small-encode
  gap to near-zero.
- (B) Implement a small-object backing store pool inside the
  embedder ArrayBuffer::Allocator. Less invasive than (A) but doesn't
  solve the cppgc-pointer-stability concern fully.
  Either way: closes the encode gap, drops libc time below 5 %. Cross-cutting
  with perf-research/fetch (every server response body), perf-research/streams,
  perf-research/structuredClone, and several other surfaces — this is the
  single highest-leverage finding across the whole research effort.

H2 — `new TextDecoder()` op-dispatches even for the default UTF-8 case (MEDIUM × HIGH)

Evidence. new TextDecoder() is 187 ns in Deno vs 75 ns in Node and
7 ns in Bun (28× Bun, 2.5× Node).
Architectural root. Constructor at
ext/web/08_text_encoding.js:85 calls
op_encoding_normalize_label(label) unconditionally; later code (line
213) sets up a cppgc resource only for non-UTF-8 cases. The default
UTF-8 path still pays the label-normalize op. Bun's TextDecoder
presumably elides this entirely on the default-utf8 path.
Estimated impact if fixed. Skip the label normalize op for the
common case of new TextDecoder() (no args) and new TextDecoder("utf-8").
Brings construct to ~10 ns (just the brand assignment). Affects every
fetch handler that constructs a TextDecoder per request, every streaming
reader, etc. Not a leading cost individually but very common.

H3 — `decode(chunk, {stream:true})` allocates `vec![0; max_utf16_len]` per chunk and returns marshaled `U16String` (MEDIUM × HIGH)

Evidence. decode_stream_5chunks is 2.06× Node and 3.92× Bun.
Architectural root. ext/web/lib.rs:631-669
pre-allocates a UTF-16 vector sized at max_utf16_buffer_length(data.len())
then truncates. Returns U16String which is then marshaled back as a
V8 string. Plus the cppgc resource borrow on every call.
Estimated impact if fixed. For ASCII-only chunks (the common case
for HTTP body streaming), follow the simdutf ASCII fast path that the
non-stream UTF-8 path already has — write straight to a V8 one-byte
string and skip the UTF-16 vector. Brings stream decode in line with
non-stream decode (~118 ns/chunk instead of 510 ns/chunk in this
bench). For binary or non-ASCII streams, the existing path is fine.
Affects any code that uses response.body + TextDecoderStream (a
very common SSE / line-streamed-JSON pattern).

H4 — `op_encoding_encode_into` already has a fast path; encode does not (MEDIUM × MEDIUM)

Evidence. encodeInto(small, dest) is 40 ns in Deno (faster than
Node's 53 ns, matches Bun's 42 ns). But encode(small) is 1 512 ns in
Deno (20-30× the encodeInto cost on the same input).
Architectural root. encodeInto writes to a caller-provided buffer
(ext/web/lib.rs:686-704) — no allocation,
no backing-store creation. encode always allocates. The asymmetry
is large.
Estimated impact if fixed. The fix is H1's small-buffer pool /
on-heap path. A workaround would be to expose a fast path in
op_encode that uses the same write-into-caller-buffer pattern when
the encoded size is bounded (e.g. ≤256 bytes) — encode into a stack
buffer first, then a single ArrayBuffer creation with the exact size.
Listed separately from H1 because the encode-into infrastructure already
exists and could be reused.

Non-finding — `decode(1 MB ASCII)` is 6.5× faster than Node (architectural win)

Documented here so the team doesn't accidentally undo it:
ext/web/lib.rs:528-535 and the matching
libs/core/ops_builtin_v8.rs:545-551
both short-circuit through v8::simdutf::validate_ascii →
v8::String::new_from_one_byte. Pure ASCII (the dominant real-world
shape: HTTP/JSON bodies, file reads, console output) skips V8's internal
UTF-8 validation entirely. This single decision puts Deno's bulk-ASCII
decode well ahead of every other runtime measured.

Reproduction

See tools/perf_research/text-encoding/README.md.

cargo build --release --bin deno
./target/release/deno run -A --no-prompt tools/perf_research/text-encoding/micro/text_encoding_micro.js
node tools/perf_research/text-encoding/micro/text_encoding_micro.js
bun  tools/perf_research/text-encoding/micro/text_encoding_micro.js

…8 prof - micro/text_encoding_micro.js: 14 ops covering encode (tiny/small/medium ASCII + UTF-8 mixed), encodeInto, decode (tiny/small/medium/large ASCII + UTF-8 mixed), stream-mode decode, encoder/decoder construct. - micro_results.jsonl: Deno 2.7.14 / Node v22.22.2 / Bun 1.3.14 numbers. - profiles/text_encoding_micro.prof.txt + .v8.log.gz: V8 --prof attribution showing libc malloc/realloc dominate the encode path (24.5 % of total ticks in libc, ~32 % nonlib in malloc/realloc) — confirms architectural cost of fresh backing stores per encode call.

crowlbot mentioned this pull request May 18, 2026

perf research: streams #4

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf research: text-encoding#3

perf research: text-encoding#3
crowlbot wants to merge 1 commit into
mainfrom
perf-research/text-encoding

crowlbot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

crowlbot commented May 14, 2026

perf research: text-encoding

Methodology

Headline ratios — microbench (ns/op; lower is better)

Flamegraph attribution (V8 --prof)

Where the cost lives

Ranked architectural hypotheses

H1 — TextEncoder.encode allocates a fresh malloc-backed ArrayBuffer on every call; on-heap-TypedArray fast path is disabled (HIGH × HIGH)

H2 — new TextDecoder() op-dispatches even for the default UTF-8 case (MEDIUM × HIGH)

H3 — decode(chunk, {stream:true}) allocates vec![0; max_utf16_len] per chunk and returns marshaled U16String (MEDIUM × HIGH)

H4 — op_encoding_encode_into already has a fast path; encode does not (MEDIUM × MEDIUM)

Non-finding — decode(1 MB ASCII) is 6.5× faster than Node (architectural win)

Reproduction

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Flamegraph attribution (V8 `--prof`)

H1 — `TextEncoder.encode` allocates a fresh malloc-backed `ArrayBuffer` on every call; on-heap-TypedArray fast path is disabled (HIGH × HIGH)

H2 — `new TextDecoder()` op-dispatches even for the default UTF-8 case (MEDIUM × HIGH)

H3 — `decode(chunk, {stream:true})` allocates `vec![0; max_utf16_len]` per chunk and returns marshaled `U16String` (MEDIUM × HIGH)

H4 — `op_encoding_encode_into` already has a fast path; encode does not (MEDIUM × MEDIUM)

Non-finding — `decode(1 MB ASCII)` is 6.5× faster than Node (architectural win)