Releases: LostBeard/SpawnDev.ILGPU
SpawnDev.ILGPU v4.6.0
SpawnDev.ILGPU v4.6.0
6 backends. 1,511 tests. Zero failures.
CUDA, OpenCL, CPU, WebGPU, WebGL, and Wasm — all passing. GPU compute in the browser is no longer experimental.
Highlights
Full Multi-Worker Wasm Barrier Dispatch
The Wasm backend now supports full navigator.hardwareConcurrency workers with group barriers and shared memory. A pure spin barrier using i32.atomic.load loops replaces the previous wait32/notify approach, working around a V8 atomics visibility gap that caused data races with 3+ workers.
RadixSort Verified at Scale
RadixSort passes across all data types and sizes up to 4M elements on every backend — including Wasm in the browser. Key fixes:
- Histogram counter buffer sizing — fixed undersized counters that caused real out-of-bounds writes during grid-stride iteration
- Grid-stride tail byte padding — extended linear-memory slack allocation to prevent OOB traps on packed buffers
- Per-worker scratch isolation — eliminated intermittent sort corruption in non-barrier kernels
20+ Wasm Codegen Fixes
Deep correctness pass across the Wasm code generator:
- Fiber yield-per-phase with dynamic block splitting
- Atomic loads/stores for all shared memory access in barrier kernels (including float via i32/i64 reinterpret)
- Struct load copy semantics to prevent aliasing
- Unsigned comparison in
MinUInt32/MinUInt64reductions - Correct atomic RMW opcode table for interleaved sub-word variants
- Local alloca addressing, shared memory deduplication, and IR address space aliasing guards
WebGPU Backend Fixes
- WGSL loop break + bool PHI: correct merge value generation when breaking from loops with boolean phi nodes
- WGSL continuation after if-else with break: prevent unreachable code generation
Test Results
| Backend | Pass | Fail | Skip |
|---|---|---|---|
| CUDA | all | 0 | — |
| OpenCL | all | 0 | — |
| CPU | all | 0 | — |
| WebGPU | 229 | 0 | 12 |
| WebGL | 139 | 0 | 115 |
| Wasm | 249 | 0 | 3 |
| Total | 1,511 | 0 | 162 |
WebGL skips are architectural (GLSL ES 3.0 lacks shared memory/barriers/atomics). Wasm skips are subgroup-dependent features not available in browser WebAssembly.
What This Means
This release proves that GPU-class parallel algorithms — radix sort, scan, reduce, atomics, shared memory, group barriers — run correctly in the browser across WebGPU, WebGL, and WebAssembly, alongside native CUDA, OpenCL, and CPU backends. Write your kernel once, run it everywhere.
SpawnDev.ILGPU v4.0.0
SpawnDev.ILGPU v4.0.0
Run ILGPU C# kernels on WebGPU, WebGL, Wasm, CUDA, OpenCL, and CPU — from a single codebase.
This is a major release with deep improvements to the WebGPU and Wasm backends, bringing ILGPU's algorithm library (RadixSort, Scan, Reduce) to the browser for the first time.
Highlights
WebGPU RadixSort — Full Algorithm Support
All RadixSort variants now pass on WebGPU, including large-scale sorts (4M+ elements), pairs, descending, and multiple data types. Fixed shared memory sizing, scan barrier synchronization, range checks for auto-grouped kernels, and 256-byte alignment padding for minStorageBufferOffsetAlignment.
Wasm Backend — Barrier Kernel Infrastructure
The Wasm backend received 7 codegen and dispatch fixes enabling correct barrier-synchronized kernels (Scan, Reduce, and single-group RadixSort):
- Struct-with-view serialization — Fixed CLR-to-IR layout mismatch for kernel structs containing ArrayViews (e.g.,
InitializerImplementation<T>). Manual IR-layout-aware serialization replacesUnsafe.Write. - View field mapping — Fixed
GetFieldhandler returning 0 for ArrayView1D's Extent (Length) field, which caused allview.Lengthchecks to fail silently. - Local alloca addressing — Fixed local memory allocations defaulting to address 0, which caused the ExclusiveScan helper to corrupt the data buffer between sort passes.
- Per-thread scratch memory — Each parallel Web Worker now gets its own scratch region, preventing cross-worker data races during struct construction.
- Post-helper barriers — Added synchronization barriers after each ExclusiveScan helper call to prevent fast workers from starting the next scan while slow workers are still completing the previous one.
- SpecializedValue unwrapping — Fixed dispatch to correctly extract scalar values from
SpecializedValue<T>wrapper structs. - GetViewLength tracing — Added
TraceToParameter()to resolve view sources through GetField/NewView chains.
WebGPU Backend Refactor
Major internal restructuring for maintainability and performance:
- Extracted
SharedMemoryResolverandUniformityAnalyzerinto standalone subsystems - Per-function emulation library trimming via BFS dependency graph
- Dead variable elimination post-pass for cleaner generated WGSL
- i64 constant hoisting to module-scope
constdeclarations - Pre-compiled regex patterns replacing runtime
Regex.IsMatchcalls - WGSL pre-validation (
ValidateWGSL()) catches shader errors before GPU submission KernelSpecializationfor all algorithm kernel loaders (RadixSort, Histogram, Scan, etc.)
Device Loss Detection
- WebGPU: Monitors
device.lostpromise.IsDeviceLostproperty andDeviceLostevent. - WebGL: Monitors
webglcontextlostevent via glWorker.js.IsContextLostproperty andContextLostevent. - Intentional disposal (
Dispose()) is filtered out — only unexpected losses fire the events.
Test Infrastructure
- PlaywrightMultiTest: Unified NUnit + Playwright runner executes all tests (desktop + browser) in a single
dotnet testinvocation - 1316 tests passing across all 6 backends (WebGPU, WebGL, Wasm, CUDA, OpenCL, CPU), 0 failures
Browser Backend Capabilities
| WebGPU | WebGL | Wasm | |
|---|---|---|---|
| Shared Memory | ✅ | ❌ | ✅ |
| Group.Barrier() | ✅ | ❌ | ✅ |
| Atomics | ✅ | ❌ | ✅ |
| ILGPU Algorithms | ✅ RadixSort, Scan, Reduce, Histogram | ❌ | ✅ Scan, Reduce (single-group) |
| 64-bit (f64/i64) | ✅ Emulated | ✅ Emulated | ✅ Native |
Known Limitations
- Wasm multi-group barrier dispatch: Barrier kernels are fully correct for single-group workloads (up to 64 elements for groupSize=64). Multi-group workloads have a cross-group SharedArrayBuffer memory visibility limitation in current browsers. A cooperative scheduling fix is planned for a future release. Desktop backends and WebGPU have no such limitation.
Breaking Changes
None. Existing ILGPU kernels and API usage are fully compatible.
Installation
dotnet add package SpawnDev.ILGPU --version 4.0.0Links
- Live Demo — Fractal Explorer, 3D Raymarching, GPU Boids, Benchmarks, Unit Tests
- Documentation — Getting Started, Backends, Kernels, Memory & Buffers, Canvas Rendering
- GitHub
SpawnDev.ILGPU v3.5.0
SpawnDev.ILGPU 3.5.0
Half (f16) Support
- WebGPU f16 kernels —
Float16maps to nativef16in WGSL. Buffer alignment, constant emission, andHalf ↔ floatconversion intrinsics all wired up. Capability-gated on device feature support. XMath.Min/Max/ClampforHalf— Added toXMathvia float promotion.- Group Scan/Reduce for
Half—ExclusiveScan,InclusiveScan,AllReduce, andGroupReducenow supportHalfon WebGPU and CUDA. - CUDA PTX Half warp shuffles —
WarpShuffle,WarpShuffleDown,WarpShuffleUp,WarpShuffleXor(and SubWarp variants) forHalfviab32widening. Unlocks Half scan/reduce on CUDA. - Lock-free
AllReduce— RewroteAllReducein both IL and PTX backends to use per-warp shared-memory slots instead of atomic operations. Removes the Half atomics dependency entirely and is correct for all types. Half.Oneconstant fix — Was0x0001(denormal ≈5.96e-8); corrected to0x3C00(IEEE-7541.0).
WebGPU RadixSort with double / long Keys
RadixSortPairs<double, …>andRadixSortPairs<long, …>now work on WebGPU. Multiple root causes fixed end-to-end:FloatAsInt/IntAsFloatcasts for emulatedf64now correctly reconstruct the IEEE-754 64-bit pattern.- Structs containing emulated 64-bit fields are flattened to
array<u32>in WGSL ("packed structs") to match CPU memory layout. - True element count is passed to the GPU via a dedicated
_scalar_paramsslot, replacing the incorrectarrayLength()calculation for packed views. - Sub-view element offset is now computed in u32 units (
padding / 4) instead of logical CPU elements, fixing sort correctness for array sizes where the inner temp allocation doesn't start at a 256-byte boundary.
Canvas Rendering (ICanvasRenderer)
ICanvasRendererAPI — New interface for presenting ILGPU pixel buffers (MemoryBuffer2D<uint/int>, packed RGBA) directly to an HTML<canvas>element. Obtained viaCanvasRendererFactory.Create(accelerator).- WebGPU — Zero-copy path: a cached WGSL fullscreen-triangle pipeline reads the pixel buffer directly from a
read-only-storagebinding. No CPU readback. Blit to the visible canvas viadrawImage. Pipeline and bind-group are built once; uniforms only re-uploaded on resolution change. - WebGL — Delegates to an offscreen FBO blit in the GL Web Worker. Result is transferred as
ImageBitmapback to the main thread, preventing Blazor's render cycle from clearing the canvas between frames. - CPU / Wasm — Fallback via
putImageData. Browser-backed buffers useCopyToHostUint8ArrayAsyncfor a JS-side copy; pure CPU buffers fall back to synchronousCopyToCPU.
WebGPU Warp Reduce without Subgroups
GenerateWarpReducenow emits a full shared-memory butterfly reduction when thesubgroupsfeature is unavailable, replacing the previous no-op passthrough. Correct results on hardware/drivers that don't expose subgroup extensions.
Algorithm Type Coverage
Added scan and reduce test/support variants for double, long, and uint:
| Operation | New Types |
|---|---|
ExclusiveScan |
double, uint |
InclusiveScan |
long, double, uint |
AllReduce |
double, long, uint |
GroupReduce |
float, long, double, uint, Half |
SpawnDev.ILGPU v3.3.0
SpawnDev.ILGPU v3.3.0 Release Notes
Desktop & Browser
- WPF Demo Application — new desktop demo running the same shared kernels (Fractal Explorer, 3D Raymarching, GPU Boids) on CUDA, OpenCL, and CPU with live backend switching
- Shared Kernel Library — extracted
SpawnDev.ILGPU.Demo.Sharedso browser and desktop demos share identical kernel code - Console Test Runner — added
SpawnDev.ILGPU.ConsoleDemofor running the full unit test suite on desktop backends with process isolation for crash resilience - OpenCL 3.0 Compatibility — relaxed the
GenericAddressSpacerequirement, enabling NVIDIA GPUs with OpenCL 3.0 drivers that were previously blocked - Multi-platform support — updated
SupportedPlatformto include Windows, Linux, and macOS
WebGL2 Backend — GPU-Resident Buffers
The WebGL2 backend has been refactored to eliminate unnecessary CPU↔GPU data transfers:
- GPU-resident buffers — buffers persist as textures in the GL worker; kernel dispatch sends buffer references, not data
- On-demand readback —
CopyToHostAsync()is the only GPU→CPU transfer path - New worker protocol —
allocBuffer,uploadBuffer,readbackBuffer,freeBuffermessages manage buffer lifecycle - Proper buffer disposal — buffers are freed in the worker when disposed on the C# side
Wasm Backend Improvements
- Expanded API coverage including shared memory, barriers, dynamic shared memory, atomics, and broadcasting
- Single-worker fallback mode when
SharedArrayBufferis unavailable
Transpiler Fixes
- Break-PHI bug — fixed assignments before
breakin loops being dropped in WGSL and GLSL transpilers - CopySign — corrected argument swap in the
CopySignintrinsic - 64-bit reduce — fixed signed/unsigned mismatch in
MinUInt64andemu_f64buffer I/O forAddDouble/MaxDouble - WebGL raymarching — fixed GLSL rendering issues
- BVH ray traversal — corrected WebGPU and WebGL backend issues for complex scene traversal
Upstream ILGPU Fixes
Six bugs from the original ILGPU repo have been fixed in our fork:
| Issue | Description | Severity |
|---|---|---|
| #1361 | MathF.CopySign argument order swapped — silent wrong results on all GPU backends |
High |
| #1309 | uint to float cast routed through double — crashes on devices without fp64 |
Medium |
| #1479 | Infinite compilation with large local arrays (new int[1_000_000]) — 10+ min, 10+ GB RAM |
High |
| #1538 | Internal Compiler Error with nested struct properties — wrong field slicing after type unification | Medium |
| #1539 | OpenCL produces wrong results for complex kernels — stale phi variables persisted across blocks | High |
| #1540 | H100/H200 not working — added SM_90, SM_100, SM_101, SM_120 architecture support | High |
See upstream-issues.md for detailed root cause analysis and fix descriptions.
Documentation
- Corrected synchronization semantics:
Synchronize()= flush (non-blocking),SynchronizeAsync()= flush + wait,CopyToHostAsync()= only GPU→CPU path - Updated test count to 640 tests across 8 suites
- Added WebGL GPU-resident buffer architecture documentation
- Reduced default logging verbosity across all backends
Demo Improvements
- Game of Life — fixed mouse interaction and added NavMenu icon
- Fractal Explorer — moved to shared kernel library, improved WebGL2 rendering pipeline
- Reduced console log noise for cleaner browser dev tools experience
Full Changelog: v3.2.0...v3.3.0
SpawnDev.ILGPU v3.2.0
SpawnDev.ILGPU v3.2.0
Cross-platform GPU compute from a single codebase — browser and desktop.
What's New
🖥️ Desktop Support Verified
- SpawnDev.ILGPU now officially supports desktop/server environments (Console, WPF, ASP.NET) alongside Blazor WebAssembly
- Same NuGet package provides browser backends (WebGPU, WebGL, Wasm) and native backends (Cuda, OpenCL, CPU)
SynchronizeAsync()andCopyToHostAsync()work everywhere — async in the browser, graceful sync fallback on desktop- New
SpawnDev.ILGPU.ConsoleDemoproject included as a working reference
🎮 New Demos
- Game of Life — GPU-accelerated cellular automaton
- Boids 3D — Flocking simulation on all backends
- Compute 3D — 3D compute shader demo
🐛 Bug Fixes
- Fixed 3 transpiler bugs found during Game of Life development
- Fixed handling of Debug IL in WebGPU and WebGL transpilers
- Updated Wasm backend intrinsics
📚 Comprehensive Documentation
- New
Docs/folder with 8 markdown guides: Getting Started, Backends, Kernels, Memory & Buffers, Advanced Patterns (GPU intrinsics, device sharing, rendering), Limitations, and API Reference - Covers both Blazor WASM and desktop usage
- Incorporates foundational ILGPU concepts adapted for the browser
Full Changelog
SpawnDev.ILGPU v3.0.0
SpawnDev.ILGPU v3.0.0
What's New
🚀 Next-Generation GPU Computing in Blazor Wasm — v3.0.0 brings major performance improvements, streamlined architecture, and enhanced compatibility. Run C# ILGPU kernels on WebGPU, WebGL, and native WebAssembly with automatic backend selection.
Key Features
- Three Powerful Backends — WebGPU (modern GPU compute via WGSL), WebGL (universal GPU access via GLSL ES 3.0), and Wasm (native WebAssembly on Web Workers)
- CPU Backend — Standard ILGPU CPU accelerator included for debugging and performance comparison
- Universal GPU Access — WebGPU for cutting-edge browsers, WebGL for virtually every device
- Intelligent Auto-Selection —
CreatePreferredAcceleratorAsync()automatically picks the best available backend (WebGPU → WebGL → Wasm) - 64-bit Computing — Full
doubleandlongsupport via optimized emulation on both GPU backends - Multi-Worker Dispatch — Wasm backend distributes work across all available CPU cores
- Zero-Copy Shared Memory — SharedArrayBuffer support for efficient data sharing
- Atomic Operations — Workgroup synchronization and atomic operations on WebGPU and Wasm backends
- Production Ready — Comprehensive test suite, stable APIs, and real-world optimization
Built For
- ✨ Blazor WebAssembly — Run compute-intensive C# kernels in the browser
- 🎮 Game Development — GPU-accelerated physics, graphics, and AI
- 📊 Data Processing — High-performance number crunching without native compilation
- 🔬 Scientific Computing — GPGPU capabilities in pure managed code
Resources
Full Changelog: v2.1.0...v3.0.0
SpawnDev.ILGPU v2.1.0
SpawnDev.ILGPU v2.1.0
What's New
🖼️ New WebGL Backend — GPU-accelerated compute on virtually every modern browser and device. C# kernels are transpiled to GLSL ES 3.0 vertex shaders and executed via Transform Feedback, providing broad GPU access even where WebGPU isn't supported.
Highlights
- Five backends — WebGPU, WebGL, Wasm, Workers, and CPU
- Two GPU backends — WebGPU for cutting-edge browsers, WebGL for universal coverage
- Auto-selection —
CreatePreferredAcceleratorAsync()picks the best available backend (WebGPU → WebGL → Wasm → Workers → CPU) - 64-bit emulation on both GPU backends (
double/longsupport via software emulation) - Benchmarks page — New interactive benchmark suite comparing throughput across all backends
- Workers performance — Cached compiled functions and script bodies to reduce per-dispatch overhead
Links
Full Changelog: v2.0.0...v2.1.0
SpawnDev.ILGPU v2.0.0
SpawnDev.ILGPU v2.0.0 — First Stable Release
Run ILGPU kernels in the browser — on the GPU, across threads, or on the CPU.
SpawnDev.ILGPU v2.0.0 is the first stable release of this library, the successor to SpawnDev.ILGPU.WebGPU which only supported a single WebGPU backend. Version 2.0.0 brings four full compute backends, automatic device selection, and 360+ tests — all running entirely in the browser via Blazor WebAssembly.
What's New in 2.0.0
Four Compute Backends
| Backend | Executes on | Performance |
|---|---|---|
| WebGPU | GPU via WGSL transpilation | ⚡⚡⚡ Fastest |
| Wasm | Web Workers via native WebAssembly binary | ⚡⚡ Fast |
| Workers | Web Workers via JavaScript transpilation | ⚡ Moderate |
| CPU | Main thread via .NET runtime | 🐢 Fallback |
Automatic Backend Selection
Call CreatePreferredAcceleratorAsync() and the library picks the best available backend: WebGPU → Wasm → Workers → CPU.
Key Features
- WGSL transpilation — C# ILGPU kernels compiled to WebGPU Shading Language for GPU execution
- Wasm compilation — Kernels compiled to native WebAssembly binary modules for near-native performance
- 64-bit emulation — Full
double(f64) andlong(i64) support via software emulation on WebGPU - WebGPU extension auto-detection — Probes adapter for
shader-f16,subgroups,timestamp-queryand enables them automatically - Subgroup operations —
Group.BroadcastandWarp.Shufflesupported when the browser exposes thesubgroupsextension - Multi-worker dispatch — Wasm and Workers backends distribute work across all available CPU cores
- Shared memory & atomics — Workgroup memory, barriers, and atomic operations across backends
- No native dependencies — Pure C#, powered by SpawnDev.BlazorJS
360+ Tests
Comprehensive coverage across all backends: memory, indexing, arithmetic, bitwise, math functions, atomics, control flow, structs, type casting, 64-bit emulation, GPU patterns, shared memory, broadcast & subgroups, and more.
Interactive Demo
Try the live demo featuring a real-time Fractal Explorer that lets you switch between all four backends and compare performance.
Installation
dotnet add package SpawnDev.ILGPUBreaking Changes from SpawnDev.ILGPU.WebGPU
This package replaces SpawnDev.ILGPU.WebGPU. Key differences:
- Namespace:
SpawnDev.ILGPU(wasSpawnDev.ILGPU.WebGPU) - Multiple backends: WebGPU is no longer the only option — Wasm, Workers, and CPU backends are included
- Unified API:
Context.CreateAsync()with builder pattern for all backends