Release Notes

Major New Features and Improvements

Memory optimization strategy "eager deletion" now supports sub-block in control flow operators (e.g. if-else, while). Significantly reduce memory consumption of models with control flow operators.
Optimize split operator, significantly improve performance.
Extend multiclass_nms operator, supports polygon bounding box.
Added generate_proposals operator CUDA implementation, significantly improve performance.
Support fusing affine_channel operator and batch_norm operator, significantly improve performance.
Optimize depthwise_conv operator, significantly improve performance.
Optimize reduce_mean operator, significantly improve performance.
Optimize sum operator, significantly improve performance.
Optimize top_k operator, significantly improve performance.
Added new sequence_slice operator. For a sequence, slice sub-sequence based on specified start and length.
Added new sequence_unpad operator. Support padding Tensor to LoDTensor conversion.
Added new sequence_reverse operator. roi_align operator, affine_channel operator.

Added avx, noavx auto switch feature, allow major models to automatically switch among avx, avx2, avx512.
Improve inference usability: Only need to include 1 header and 1 library.
Significantly improve inference performance.

显存优化策略eager deletion支持control flow (e.g. if-else, while)中子block的优化。显著降低包含control flow的模型的显存开销。
优化了split operator，显著提升性能。
扩展multiclass_nms operator，支持多边形的预测框。
新增generatoe_proposals operator的CUDA实现，显著提升性能。
通过affine_channel operator融合batch_norm operator，显著提升性能。
优化depthwise_conv operator的forward和backward，显著提升性能。
优化reduce_mean operator。
优化sum operator，该operator在输入是Tensor的情况下，减少一次zero memory耗时。
优化top_k operator，显著提升性能。
新增sequence_slice operator，对于一个sequence，可以从指定位置开始，slice出指定长度的subsequence。
新增sequence_unpad operator，支持padding Tensor转LoDTensor。
新增sequence_reverse operator，roi_align operator，affine_channel operator。