What's new?

General:

Update copying API header files to make Linux logic consistent with Windows (#19736) - @mszhanyi
Pin ONNX version to fix DML and Python packaging pipeline exceptions (#20073) - @mszhanyi

Build System & Packages:

Fix minimal build with training APIs enabled bug affecting Apple framework (#19858) - @edgchen1

Core:

CUDA EP:

Fix onnxruntime_test_all build break with CUDA (#19673) - @gedoensmax
Fix broken pooling CUDA NHWC ops and ensure NCHW / NHWC parity (#19889) - @mtavenrath

TensorRT EP:

Web:

Add hardSigmoid op support and hardSigmoid activation for fusedConv (#19215, #19233) - @qjia7
Add support for WebNN async API with Asyncify (#19415) - @Honry
Add uniform support for conv, conv transpose, conv grouped, and fp16 (#18753, #19098) - @axinging
Add capture and replay support for JS EP (#18989) - @fs-eire
Add LeakyRelu activation for fusedConv (#19369) - @qjia7
Add FastGelu custom op support (#19392) - @fs-eire
Allow uint8 tensors for WebGPU (#19545) - @satyajandhyala
Add and optimize MatMulNBits (#19852) - @satyajandhyala
Enable ort-web with any Float16Array polyfill (#19305) - @fs-eire
Allow multiple EPs to be specified in backend resolve logic (#19735) - @fs-eire
Various bug fixes: (#19258) - @gyagp, (#19201, #19554) - @hujiajie, (#19262, #19981) - @guschmue, (#19581, #19596, #19387) - @axinging, (#19613) - @satyajandhyala
Various improvements for performance and usability: (#19202) - @qjia7, (#18900, #19281, #18883) - @axinging, (#18788, #19737) - @satyajandhyala, (#19610) - @segevfiner, (#19614, #19702, #19677, #19857, #19940) - @fs-eire, (#19791) - @gyagp, (#19868) - @guschmue, (#19433) - @martholomew, (#19932) - @ibelem

Windows:

Fix Windows memory mapping bug affecting some larger models (#19623) - @yufenglee

Kernel Optimizations:

Fix GQA and Rotary Embedding bugs affecting some models (#19801, #19874) - @aciddelgado
Update replacement of MultiHeadAttention (MHA) and GroupQueryAttention (GQA) (#19882) - @kunal-vaishnavi
Add support for packed QKV input and Rotary Embedding with sm<80 using Memory Efficient Attention kernel (#20012) - @aciddelgado

Models:

Add support for benchmarking LLaMA model end-to-end performance (#19985, #20033, #20149) - @kunal-vaishnavi
Add example to demonstrate export of Open AI Whisper implementation with batched prompts (#19854) - @shubhambhokare1

This patch release also includes additional fixes by @spampana95 and @enximi. Big thank you to all our contributors!

Provide feedback