23 Mar 01:49

LostBeard

f2c3a0e

SpawnDev.ILGPU v4.6.0 Latest

Latest

SpawnDev.ILGPU v4.6.0

6 backends. 1,511 tests. Zero failures.

CUDA, OpenCL, CPU, WebGPU, WebGL, and Wasm — all passing. GPU compute in the browser is no longer experimental.

Highlights

Full Multi-Worker Wasm Barrier Dispatch

The Wasm backend now supports full navigator.hardwareConcurrency workers with group barriers and shared memory. A pure spin barrier using i32.atomic.load loops replaces the previous wait32/notify approach, working around a V8 atomics visibility gap that caused data races with 3+ workers.

RadixSort Verified at Scale

RadixSort passes across all data types and sizes up to 4M elements on every backend — including Wasm in the browser. Key fixes:

Histogram counter buffer sizing — fixed undersized counters that caused real out-of-bounds writes during grid-stride iteration
Grid-stride tail byte padding — extended linear-memory slack allocation to prevent OOB traps on packed buffers
Per-worker scratch isolation — eliminated intermittent sort corruption in non-barrier kernels

20+ Wasm Codegen Fixes

Deep correctness pass across the Wasm code generator:

Fiber yield-per-phase with dynamic block splitting
Atomic loads/stores for all shared memory access in barrier kernels (including float via i32/i64 reinterpret)
Struct load copy semantics to prevent aliasing
Unsigned comparison in MinUInt32/MinUInt64 reductions
Correct atomic RMW opcode table for interleaved sub-word variants
Local alloca addressing, shared memory deduplication, and IR address space aliasing guards

WebGPU Backend Fixes

WGSL loop break + bool PHI: correct merge value generation when breaking from loops with boolean phi nodes
WGSL continuation after if-else with break: prevent unreachable code generation

Test Results

Backend	Pass	Skip
CUDA	all	—
OpenCL	all	—
CPU	all	—
WebGPU	229	12
WebGL	139	115
Wasm	249	3
Total	1,511	162

WebGL skips are architectural (GLSL ES 3.0 lacks shared memory/barriers/atomics). Wasm skips are subgroup-dependent features not available in browser WebAssembly.

What This Means

This release proves that GPU-class parallel algorithms — radix sort, scan, reduce, atomics, shared memory, group barriers — run correctly in the browser across WebGPU, WebGL, and WebAssembly, alongside native CUDA, OpenCL, and CPU backends. Write your kernel once, run it everywhere.

Assets 2

15 Mar 22:23

LostBeard

v4.0.0

db4cb99

SpawnDev.ILGPU v4.0.0

Run ILGPU C# kernels on WebGPU, WebGL, Wasm, CUDA, OpenCL, and CPU — from a single codebase.

This is a major release with deep improvements to the WebGPU and Wasm backends, bringing ILGPU's algorithm library (RadixSort, Scan, Reduce) to the browser for the first time.

Highlights

WebGPU RadixSort — Full Algorithm Support

All RadixSort variants now pass on WebGPU, including large-scale sorts (4M+ elements), pairs, descending, and multiple data types. Fixed shared memory sizing, scan barrier synchronization, range checks for auto-grouped kernels, and 256-byte alignment padding for minStorageBufferOffsetAlignment.

Wasm Backend — Barrier Kernel Infrastructure

The Wasm backend received 7 codegen and dispatch fixes enabling correct barrier-synchronized kernels (Scan, Reduce, and single-group RadixSort):

Struct-with-view serialization — Fixed CLR-to-IR layout mismatch for kernel structs containing ArrayViews (e.g., InitializerImplementation<T>). Manual IR-layout-aware serialization replaces Unsafe.Write.
View field mapping — Fixed GetField handler returning 0 for ArrayView1D's Extent (Length) field, which caused all view.Length checks to fail silently.
Local alloca addressing — Fixed local memory allocations defaulting to address 0, which caused the ExclusiveScan helper to corrupt the data buffer between sort passes.
Per-thread scratch memory — Each parallel Web Worker now gets its own scratch region, preventing cross-worker data races during struct construction.
Post-helper barriers — Added synchronization barriers after each ExclusiveScan helper call to prevent fast workers from starting the next scan while slow workers are still completing the previous one.
SpecializedValue unwrapping — Fixed dispatch to correctly extract scalar values from SpecializedValue<T> wrapper structs.
GetViewLength tracing — Added TraceToParameter() to resolve view sources through GetField/NewView chains.

WebGPU Backend Refactor

Major internal restructuring for maintainability and performance:

Extracted SharedMemoryResolver and UniformityAnalyzer into standalone subsystems
Per-function emulation library trimming via BFS dependency graph
Dead variable elimination post-pass for cleaner generated WGSL
i64 constant hoisting to module-scope const declarations
Pre-compiled regex patterns replacing runtime Regex.IsMatch calls
WGSL pre-validation (ValidateWGSL()) catches shader errors before GPU submission
KernelSpecialization for all algorithm kernel loaders (RadixSort, Histogram, Scan, etc.)

Device Loss Detection

WebGPU: Monitors device.lost promise. IsDeviceLost property and DeviceLost event.
WebGL: Monitors webglcontextlost event via glWorker.js. IsContextLost property and ContextLost event.
Intentional disposal (Dispose()) is filtered out — only unexpected losses fire the events.

Test Infrastructure

PlaywrightMultiTest: Unified NUnit + Playwright runner executes all tests (desktop + browser) in a single dotnet test invocation
1316 tests passing across all 6 backends (WebGPU, WebGL, Wasm, CUDA, OpenCL, CPU), 0 failures

Browser Backend Capabilities

	WebGPU	WebGL	Wasm
Shared Memory	✅	❌	✅
Group.Barrier()	✅	❌	✅
Atomics	✅	❌	✅
ILGPU Algorithms	✅ RadixSort, Scan, Reduce, Histogram	❌	✅ Scan, Reduce (single-group)
64-bit (f64/i64)	✅ Emulated	✅ Emulated	✅ Native

Known Limitations

Wasm multi-group barrier dispatch: Barrier kernels are fully correct for single-group workloads (up to 64 elements for groupSize=64). Multi-group workloads have a cross-group SharedArrayBuffer memory visibility limitation in current browsers. A cooperative scheduling fix is planned for a future release. Desktop backends and WebGPU have no such limitation.

Breaking Changes

None. Existing ILGPU kernels and API usage are fully compatible.

Installation

dotnet add package SpawnDev.ILGPU --version 4.0.0

SpawnDev.ILGPU 3.5.0

Half (f16) Support

WebGPU f16 kernels — Float16 maps to native f16 in WGSL. Buffer alignment, constant emission, and Half ↔ float conversion intrinsics all wired up. Capability-gated on device feature support.
XMath.Min/Max/Clamp for Half — Added to XMath via float promotion.
Group Scan/Reduce for Half — ExclusiveScan, InclusiveScan, AllReduce, and GroupReduce now support Half on WebGPU and CUDA.
CUDA PTX Half warp shuffles — WarpShuffle, WarpShuffleDown, WarpShuffleUp, WarpShuffleXor (and SubWarp variants) for Half via b32 widening. Unlocks Half scan/reduce on CUDA.
Lock-free AllReduce — Rewrote AllReduce in both IL and PTX backends to use per-warp shared-memory slots instead of atomic operations. Removes the Half atomics dependency entirely and is correct for all types.
Half.One constant fix — Was 0x0001 (denormal ≈5.96e-8); corrected to 0x3C00 (IEEE-754 1.0).

WebGPU RadixSort with `double` / `long` Keys

RadixSortPairs<double, …> and RadixSortPairs<long, …> now work on WebGPU. Multiple root causes fixed end-to-end:
- FloatAsInt/IntAsFloat casts for emulated f64 now correctly reconstruct the IEEE-754 64-bit pattern.
- Structs containing emulated 64-bit fields are flattened to array<u32> in WGSL ("packed structs") to match CPU memory layout.
- True element count is passed to the GPU via a dedicated _scalar_params slot, replacing the incorrect arrayLength() calculation for packed views.
- Sub-view element offset is now computed in u32 units (padding / 4) instead of logical CPU elements, fixing sort correctness for array sizes where the inner temp allocation doesn't start at a 256-byte boundary.

Canvas Rendering (`ICanvasRenderer`)

ICanvasRenderer API — New interface for presenting ILGPU pixel buffers (MemoryBuffer2D<uint/int>, packed RGBA) directly to an HTML <canvas> element. Obtained via CanvasRendererFactory.Create(accelerator).
WebGPU — Zero-copy path: a cached WGSL fullscreen-triangle pipeline reads the pixel buffer directly from a read-only-storage binding. No CPU readback. Blit to the visible canvas via drawImage. Pipeline and bind-group are built once; uniforms only re-uploaded on resolution change.
WebGL — Delegates to an offscreen FBO blit in the GL Web Worker. Result is transferred as ImageBitmap back to the main thread, preventing Blazor's render cycle from clearing the canvas between frames.
CPU / Wasm — Fallback via putImageData. Browser-backed buffers use CopyToHostUint8ArrayAsync for a JS-side copy; pure CPU buffers fall back to synchronous CopyToCPU.

WebGPU Warp Reduce without Subgroups

GenerateWarpReduce now emits a full shared-memory butterfly reduction when the subgroups feature is unavailable, replacing the previous no-op passthrough. Correct results on hardware/drivers that don't expose subgroup extensions.

Algorithm Type Coverage

Added scan and reduce test/support variants for double, long, and uint:

Operation	New Types
`ExclusiveScan`	`double`, `uint`
`InclusiveScan`	`long`, `double`, `uint`
`AllReduce`	`double`, `long`, `uint`
`GroupReduce`	`float`, `long`, `double`, `uint`, `Half`

Assets 2

22 Feb 06:41

LostBeard

v3.3.0

c04e7d0

SpawnDev.ILGPU v3.3.0

SpawnDev.ILGPU v3.3.0 Release Notes

Desktop & Browser

WPF Demo Application — new desktop demo running the same shared kernels (Fractal Explorer, 3D Raymarching, GPU Boids) on CUDA, OpenCL, and CPU with live backend switching
Shared Kernel Library — extracted SpawnDev.ILGPU.Demo.Shared so browser and desktop demos share identical kernel code
Console Test Runner — added SpawnDev.ILGPU.ConsoleDemo for running the full unit test suite on desktop backends with process isolation for crash resilience
OpenCL 3.0 Compatibility — relaxed the GenericAddressSpace requirement, enabling NVIDIA GPUs with OpenCL 3.0 drivers that were previously blocked
Multi-platform support — updated SupportedPlatform to include Windows, Linux, and macOS

WebGL2 Backend — GPU-Resident Buffers

The WebGL2 backend has been refactored to eliminate unnecessary CPU↔GPU data transfers:

GPU-resident buffers — buffers persist as textures in the GL worker; kernel dispatch sends buffer references, not data
On-demand readback — CopyToHostAsync() is the only GPU→CPU transfer path
New worker protocol — allocBuffer, uploadBuffer, readbackBuffer, freeBuffer messages manage buffer lifecycle
Proper buffer disposal — buffers are freed in the worker when disposed on the C# side

Wasm Backend Improvements

Expanded API coverage including shared memory, barriers, dynamic shared memory, atomics, and broadcasting
Single-worker fallback mode when SharedArrayBuffer is unavailable

Transpiler Fixes

Break-PHI bug — fixed assignments before break in loops being dropped in WGSL and GLSL transpilers
CopySign — corrected argument swap in the CopySign intrinsic
64-bit reduce — fixed signed/unsigned mismatch in MinUInt64 and emu_f64 buffer I/O for AddDouble/MaxDouble
WebGL raymarching — fixed GLSL rendering issues
BVH ray traversal — corrected WebGPU and WebGL backend issues for complex scene traversal

Upstream ILGPU Fixes

Six bugs from the original ILGPU repo have been fixed in our fork:

Issue	Description	Severity
#1361	`MathF.CopySign` argument order swapped — silent wrong results on all GPU backends	High
#1309	`uint` to `float` cast routed through `double` — crashes on devices without fp64	Medium
#1479	Infinite compilation with large local arrays (`new int[1_000_000]`) — 10+ min, 10+ GB RAM	High
#1538	Internal Compiler Error with nested struct properties — wrong field slicing after type unification	Medium
#1539	OpenCL produces wrong results for complex kernels — stale phi variables persisted across blocks	High
#1540	H100/H200 not working — added SM_90, SM_100, SM_101, SM_120 architecture support	High

See upstream-issues.md for detailed root cause analysis and fix descriptions.

Documentation

Corrected synchronization semantics: Synchronize() = flush (non-blocking), SynchronizeAsync() = flush + wait, CopyToHostAsync() = only GPU→CPU path
Updated test count to 640 tests across 8 suites
Added WebGL GPU-resident buffer architecture documentation
Reduced default logging verbosity across all backends

Demo Improvements

Game of Life — fixed mouse interaction and added NavMenu icon
Fractal Explorer — moved to shared kernel library, improved WebGL2 rendering pipeline
Reduced console log noise for cleaner browser dev tools experience

Full Changelog: v3.2.0...v3.3.0

Assets 2

21 Feb 14:14

LostBeard

v3.2.0

bdf22cb

SpawnDev.ILGPU v3.2.0

Cross-platform GPU compute from a single codebase — browser and desktop.

What's New

🖥️ Desktop Support Verified

SpawnDev.ILGPU now officially supports desktop/server environments (Console, WPF, ASP.NET) alongside Blazor WebAssembly
Same NuGet package provides browser backends (WebGPU, WebGL, Wasm) and native backends (Cuda, OpenCL, CPU)
SynchronizeAsync() and CopyToHostAsync() work everywhere — async in the browser, graceful sync fallback on desktop
New SpawnDev.ILGPU.ConsoleDemo project included as a working reference

🎮 New Demos

Game of Life — GPU-accelerated cellular automaton
Boids 3D — Flocking simulation on all backends
Compute 3D — 3D compute shader demo

🐛 Bug Fixes

Fixed 3 transpiler bugs found during Game of Life development
Fixed handling of Debug IL in WebGPU and WebGL transpilers
Updated Wasm backend intrinsics

📚 Comprehensive Documentation

New Docs/ folder with 8 markdown guides: Getting Started, Backends, Kernels, Memory & Buffers, Advanced Patterns (GPU intrinsics, device sharing, rendering), Limitations, and API Reference
Covers both Blazor WASM and desktop usage
Incorporates foundational ILGPU concepts adapted for the browser

Full Changelog

See README.md and Docs/ for complete documentation.

Assets 2

16 Feb 17:39

LostBeard

v3.0.0

6c5a0f8

SpawnDev.ILGPU v3.0.0

What's New

🚀 Next-Generation GPU Computing in Blazor Wasm — v3.0.0 brings major performance improvements, streamlined architecture, and enhanced compatibility. Run C# ILGPU kernels on WebGPU, WebGL, and native WebAssembly with automatic backend selection.

Key Features

Three Powerful Backends — WebGPU (modern GPU compute via WGSL), WebGL (universal GPU access via GLSL ES 3.0), and Wasm (native WebAssembly on Web Workers)
CPU Backend — Standard ILGPU CPU accelerator included for debugging and performance comparison
Universal GPU Access — WebGPU for cutting-edge browsers, WebGL for virtually every device
Intelligent Auto-Selection — CreatePreferredAcceleratorAsync() automatically picks the best available backend (WebGPU → WebGL → Wasm)
64-bit Computing — Full double and long support via optimized emulation on both GPU backends
Multi-Worker Dispatch — Wasm backend distributes work across all available CPU cores
Zero-Copy Shared Memory — SharedArrayBuffer support for efficient data sharing
Atomic Operations — Workgroup synchronization and atomic operations on WebGPU and Wasm backends
Production Ready — Comprehensive test suite, stable APIs, and real-world optimization

Built For

✨ Blazor WebAssembly — Run compute-intensive C# kernels in the browser
🎮 Game Development — GPU-accelerated physics, graphics, and AI
📊 Data Processing — High-performance number crunching without native compilation
🔬 Scientific Computing — GPGPU capabilities in pure managed code

Resources

Full Changelog: v2.1.0...v3.0.0

Assets 2

13 Feb 20:41

LostBeard

v2.1.0

4e1e8eb

SpawnDev.ILGPU v2.1.0

What's New

🖼️ New WebGL Backend — GPU-accelerated compute on virtually every modern browser and device. C# kernels are transpiled to GLSL ES 3.0 vertex shaders and executed via Transform Feedback, providing broad GPU access even where WebGPU isn't supported.

Highlights

Five backends — WebGPU, WebGL, Wasm, Workers, and CPU
Two GPU backends — WebGPU for cutting-edge browsers, WebGL for universal coverage
Auto-selection — CreatePreferredAcceleratorAsync() picks the best available backend (WebGPU → WebGL → Wasm → Workers → CPU)
64-bit emulation on both GPU backends (double/long support via software emulation)
Benchmarks page — New interactive benchmark suite comparing throughput across all backends
Workers performance — Cached compiled functions and script bodies to reduce per-dispatch overhead

Links

Full Changelog: v2.0.0...v2.1.0

Assets 2

09 Feb 23:23

LostBeard

v2.0.0

3e793df

SpawnDev.ILGPU v2.0.0

SpawnDev.ILGPU v2.0.0 — First Stable Release

Run ILGPU kernels in the browser — on the GPU, across threads, or on the CPU.

SpawnDev.ILGPU v2.0.0 is the first stable release of this library, the successor to SpawnDev.ILGPU.WebGPU which only supported a single WebGPU backend. Version 2.0.0 brings four full compute backends, automatic device selection, and 360+ tests — all running entirely in the browser via Blazor WebAssembly.

What's New in 2.0.0

Four Compute Backends

Backend	Executes on	Performance
WebGPU	GPU via WGSL transpilation	⚡⚡⚡ Fastest
Wasm	Web Workers via native WebAssembly binary	⚡⚡ Fast
Workers	Web Workers via JavaScript transpilation	⚡ Moderate
CPU	Main thread via .NET runtime	🐢 Fallback

Automatic Backend Selection

Call CreatePreferredAcceleratorAsync() and the library picks the best available backend: WebGPU → Wasm → Workers → CPU.

Key Features

WGSL transpilation — C# ILGPU kernels compiled to WebGPU Shading Language for GPU execution
Wasm compilation — Kernels compiled to native WebAssembly binary modules for near-native performance
64-bit emulation — Full double (f64) and long (i64) support via software emulation on WebGPU
WebGPU extension auto-detection — Probes adapter for shader-f16, subgroups, timestamp-query and enables them automatically
Subgroup operations — Group.Broadcast and Warp.Shuffle supported when the browser exposes the subgroups extension
Multi-worker dispatch — Wasm and Workers backends distribute work across all available CPU cores
Shared memory & atomics — Workgroup memory, barriers, and atomic operations across backends
No native dependencies — Pure C#, powered by SpawnDev.BlazorJS

360+ Tests

Comprehensive coverage across all backends: memory, indexing, arithmetic, bitwise, math functions, atomics, control flow, structs, type casting, 64-bit emulation, GPU patterns, shared memory, broadcast & subgroups, and more.

Interactive Demo

Try the live demo featuring a real-time Fractal Explorer that lets you switch between all four backends and compare performance.

Installation

dotnet add package SpawnDev.ILGPU

Breaking Changes from SpawnDev.ILGPU.WebGPU

This package replaces SpawnDev.ILGPU.WebGPU. Key differences:

Namespace: SpawnDev.ILGPU (was SpawnDev.ILGPU.WebGPU)
Multiple backends: WebGPU is no longer the only option — Wasm, Workers, and CPU backends are included
Unified API: Context.CreateAsync() with builder pattern for all backends

Assets 2

Uh oh!

Releases: LostBeard/SpawnDev.ILGPU