SpawnDev.ILGPU v4.6.0
6 backends. 1,511 tests. Zero failures.
CUDA, OpenCL, CPU, WebGPU, WebGL, and Wasm — all passing. GPU compute in the browser is no longer experimental.
Highlights
Full Multi-Worker Wasm Barrier Dispatch
The Wasm backend now supports full navigator.hardwareConcurrency workers with group barriers and shared memory. A pure spin barrier using i32.atomic.load loops replaces the previous wait32/notify approach, working around a V8 atomics visibility gap that caused data races with 3+ workers.
RadixSort Verified at Scale
RadixSort passes across all data types and sizes up to 4M elements on every backend — including Wasm in the browser. Key fixes:
- Histogram counter buffer sizing — fixed undersized counters that caused real out-of-bounds writes during grid-stride iteration
- Grid-stride tail byte padding — extended linear-memory slack allocation to prevent OOB traps on packed buffers
- Per-worker scratch isolation — eliminated intermittent sort corruption in non-barrier kernels
20+ Wasm Codegen Fixes
Deep correctness pass across the Wasm code generator:
- Fiber yield-per-phase with dynamic block splitting
- Atomic loads/stores for all shared memory access in barrier kernels (including float via i32/i64 reinterpret)
- Struct load copy semantics to prevent aliasing
- Unsigned comparison in
MinUInt32/MinUInt64reductions - Correct atomic RMW opcode table for interleaved sub-word variants
- Local alloca addressing, shared memory deduplication, and IR address space aliasing guards
WebGPU Backend Fixes
- WGSL loop break + bool PHI: correct merge value generation when breaking from loops with boolean phi nodes
- WGSL continuation after if-else with break: prevent unreachable code generation
Test Results
| Backend | Pass | Fail | Skip |
|---|---|---|---|
| CUDA | all | 0 | — |
| OpenCL | all | 0 | — |
| CPU | all | 0 | — |
| WebGPU | 229 | 0 | 12 |
| WebGL | 139 | 0 | 115 |
| Wasm | 249 | 0 | 3 |
| Total | 1,511 | 0 | 162 |
WebGL skips are architectural (GLSL ES 3.0 lacks shared memory/barriers/atomics). Wasm skips are subgroup-dependent features not available in browser WebAssembly.
What This Means
This release proves that GPU-class parallel algorithms — radix sort, scan, reduce, atomics, shared memory, group barriers — run correctly in the browser across WebGPU, WebGL, and WebAssembly, alongside native CUDA, OpenCL, and CPU backends. Write your kernel once, run it everywhere.