ONNX Runtime v1.17.3
What's new?
General:
- Update copying API header files to make Linux logic consistent with Windows (#19736) - @mszhanyi
- Pin ONNX version to fix DML and Python packaging pipeline exceptions (#20073) - @mszhanyi
Build System & Packages:
Core:
CUDA EP:
- Fix onnxruntime_test_all build break with CUDA (#19673) - @gedoensmax
- Fix broken pooling CUDA NHWC ops and ensure NCHW / NHWC parity (#19889) - @mtavenrath
TensorRT EP:
- Fix TensorRT build break caused by image update (#19880) - @jywu-msft
- Fix TensorRT custom op list concurrency bug (#20093) - @chilo-ms
Web:
- Add hardSigmoid op support and hardSigmoid activation for fusedConv (#19215, #19233) - @qjia7
- Add support for WebNN async API with Asyncify (#19415) - @Honry
- Add uniform support for conv, conv transpose, conv grouped, and fp16 (#18753, #19098) - @axinging
- Add capture and replay support for JS EP (#18989) - @fs-eire
- Add LeakyRelu activation for fusedConv (#19369) - @qjia7
- Add FastGelu custom op support (#19392) - @fs-eire
- Allow uint8 tensors for WebGPU (#19545) - @satyajandhyala
- Add and optimize MatMulNBits (#19852) - @satyajandhyala
- Enable ort-web with any Float16Array polyfill (#19305) - @fs-eire
- Allow multiple EPs to be specified in backend resolve logic (#19735) - @fs-eire
- Various bug fixes: (#19258) - @gyagp, (#19201, #19554) - @hujiajie, (#19262, #19981) - @guschmue, (#19581, #19596, #19387) - @axinging, (#19613) - @satyajandhyala
- Various improvements for performance and usability: (#19202) - @qjia7, (#18900, #19281, #18883) - @axinging, (#18788, #19737) - @satyajandhyala, (#19610) - @segevfiner, (#19614, #19702, #19677, #19857, #19940) - @fs-eire, (#19791) - @gyagp, (#19868) - @guschmue, (#19433) - @martholomew, (#19932) - @ibelem
Windows:
- Fix Windows memory mapping bug affecting some larger models (#19623) - @yufenglee
Kernel Optimizations:
- Fix GQA and Rotary Embedding bugs affecting some models (#19801, #19874) - @aciddelgado
- Update replacement of MultiHeadAttention (MHA) and GroupQueryAttention (GQA) (#19882) - @kunal-vaishnavi
- Add support for packed QKV input and Rotary Embedding with sm<80 using Memory Efficient Attention kernel (#20012) - @aciddelgado
Models:
- Add support for benchmarking LLaMA model end-to-end performance (#19985, #20033, #20149) - @kunal-vaishnavi
- Add example to demonstrate export of Open AI Whisper implementation with batched prompts (#19854) - @shubhambhokare1
This patch release also includes additional fixes by @spampana95 and @enximi. Big thank you to all our contributors!